Clustering tasks consist in arranging a set of elements into groups with homogeneous properties/features that capture some important structure of the whole set. Being the main paradigm of unsupervised learning, clustering has become of great interest due to the considerable increase in the amount of unlabeled data in the recent years. As the characteristics of real-life data-in geometrical and statistical terms- are very diverse, an intensive research effort has been dedicated to define various clustering algorithms which adapt to some particular features and structural properties. See[henning, sklearn] for discussions on the different methods and how to choose among them. In this work, we assume that each cluster can be well represented as a set of points dispersed around a prototype. Each prototype is a summary of the location of each group and could be one of its data points. With this representation assumption, we leave out of our scope toy examples like concentric circles where other families of algorithms have good performance (see for instance [spectral]).
Among the different types of clustering algorithms, we here focus on the so-called model-based family. A well-known multivariate probabilistic model is the Gaussian Mixture Model (GMM), which represents the distribution of the data as a random variable given by a mixture of Gaussian distributions. The corresponding clustering criterion is simple: all points drawn from one of these normal distributions are considered to belong to the same cluster. The parameters of this model can be estimated using the two-step iterative Expectation-Maximization (EM) algorithm[EMalgo], based on the maximization of the likelihood. In particular for the GMM case, closed-form expressions exist for obtaining parameters estimations at the maximization step.
It is known that the EM algorithm can have a really poor performance in the presence of noise [badperf]. This phenomenon can be simply explained by the non-robustness of the estimators that are computed in the algorithm: means and sample covariance matrices [maronna]. In order to improve the performance, two main strategies were developed. One consists in modifying the model to take into account the noise and the other one is to keep the original model and replace the estimators by others that are able to deal with outliers.
Several variations of the Gaussian mixture model have been developed. We will mainly be interested in those variations that target the problem of mixtures of more general distributions, which allow to model a wider range of data, allowing for the presence of noise and outliers. Regarding the use of non-Gaussian distributions, an important model defined as a mixture of Student’s -distributions was proposed in [c2]
. In that paper, the authors developed an algorithm to estimate the parameters of the mixture with known and unknown degrees of freedom by maximizing the likelihood. Furthermore, in[c5]
the authors addressed the task of clustering by mixture models with this particular distribution. More recently, hyperbolic and skew t-distributions were considered in[hyper].
We focus in the present paper on addressing the robustness issue of GMM by proposing a model-based algorithm that provides good clustering performance in the case of more general distribution mixtures. Our method is inspired by the robust applications of the Elliptical Symmetric (ES) distributions [boente, esa]. Elliptical distributions have been widely used in many applications where Gaussian distributions are not good-enough approximations because of the presence of heavy tails or outliers [CFAR, radar]. This family includes, among others, the class of compound-Gaussian distributions [CG, CG2, c1] that contains Gaussian, Student’s and distributions. The method we develop mimics indeed a mixture of distributions within the latter very general class. Thus, our model is intended to be, by construction, more general than the classic GMM and the model proposed in [c2].
Other robust clustering approaches worth mentioning are models which add an extra term to the usual Gaussian likelihood and algorithms with modifications inspired by usual robust techniques as robust point estimators, robust scales, weights for observations and trimming techniques. For instance, [unif] considered the presence of a uniform noise as background while [RIMLE] defined a pseudo-likelihood that filters the low density areas. In [Spatial-EM], the authors replaced the usual mean and sample covariance by the spatial median and the rank covariance matrix (RCM). In [ktau] a robust scale is used to define a -means-like algorithm that can deal with outliers. Moreover, in [weights1, weights2, weighted] different weights for the observations were proposed. Small weights correspond, as usual in the robust literature, to observations that are far from the cluster centers. Finally, the trimming algorithms leave out of the parameter estimation a proportion of data points that are far from all the means [Tclust].
In this paper we present an algorithm with the following features:
it follows the two steps expectation and maximization of EM algorithms,
it derives estimations of clusters centers and covariances which turn out to be robust,
it also estimate one scale parameter per data point, increasing the overall clustering flexibility.
Though the number of parameters gets of the same order as the amount of data, the algorithm does not loose much in efficiency, even for instance when the data is close to Gaussian. On the contrary, it can actually be more efficient for noisy data; gaining flexibility to accommodate for larger tails and outliers.
The induced clustering performance is improved compared to k-means, the EM algorithm and HDBSCAN [HDBSCAN1, HDBSCAN2] when applied to real datasets such as MNIST [MNIST], NORB [NORB] and 20newsgroups [20newsgroup]. Furthermore, our algorithm is able to provide accurate estimations of location and dispersion parameters even in the presence of heavy tails or noise as proved in simulations.
The rest of the paper is organized as follows. In section 2, we present our clustering algorithm and discuss some of its important aspects in detail. Section 3 is devoted to the experimental results, which allow us to show the improved performance of our method for different real datasets in comparison with other commonly used methods such as k-means. Finally, our conclusions are stated in section 4.
Notation:Vectors (resp. matrices) are denoted by boldfaced lowercase letters (resp. uppercase letters). represents the transpose of , represents the determinant of and represents the trace of . stands for “independent and identically distributed”, w.r.t stands for “with respect to” and means “is distributed as”.
2 The Model
In this section, we present a detailed description of the proposed robust clustering algorithm. Given a set of data points in
we start by considering points as samples from a mixture of distributions with the following probability density function (pdf):
where represents the proportion of the distribution in the mixture and its pdf depending on the parameters grouped in .
Some distributions of interest include the Elliptically Symmetric distributions. An -dimensional random vector from the distribution is ES-distributed if its pdf can be written as
where is a constant, is any function (called the density generator) such that (2) defines a pdf, is the location parameter (mean) and is the scatter matrix. The matrix reflects the structure of the covariance matrix of (the covariance matrix is equal to
up to a scale factor if the distribution has a finite second-order moment, see[ollila2012complex] for details). This is denoted .
Interestingly, such modelling admits a Stochastic Representation Theorem. if and only if it admits the following stochastic representation [Yao73]
where the non-negative real random variable , called the modular variate, is independent of the random vector
that is uniformly distributed on the unit-sphere and is a factorization of .
ES distributions include the class of compound Gaussian distributions. This particular family generalizes, in a symmetric way, the Student’s
distributions by multiplying a Gaussian by a random variable representing a radial univariate variance. A random variablefollowing a compound-Gaussian distribution can be written as
where corresponds to the mean, is a positive random variable independent from , and . Generally, some one-dimensional constraints on is assumed for identifiability (between and ) conditions. In this work, we assume .
This class of distributions contains in particular the Gaussian and the heavy-tailed Student’s cases. If the scale variable is discrete constantly equal to 1, then we have a classical Gaussian distribution. In terms of pdf, Normal (Gaussian) distribution is a particular case of ES distributions in which and . Furthermore, when it is a -distribution with degrees of freedom. This class includes other possibilities such as the k-distribution where .
To have some connections between the most general model of ES distributions and the sub-class of compound Gaussian distributions, a Gaussian-core representation of CES has been introduced in [draskovic2018new]. The stochastic representation can be rewritten using the fact that , where . Hence, a random vector can be represented as
with and defined as in Eq. (3). If is independent of , the vector follows a compound-Gaussian distribution.
Because we want to keep the flexibility of this model, we do not set a particular distribution for in Eq. (4) to estimate the cluster distribution in a parametric way. Instead, we get inspiration from the compound-Gaussian family to model each cluster in an approximated way. We assume that each data point , from the cluster of the mixture, is the result of multiplying a sample from a multivariate zero-mean Gaussian by a deterministic constant and a consequent translation, i.e. it can be written as
with the mean a real vector, a deterministic constant or parameter, and with tr. As shown in Figure 1, elliptical confidence regions are setted by different values that flexibilize the cluster membership. This flexibility is reached without affecting the important parameter estimations. Notice that assuming all the in the cluster as unknown deterministic parameters is equivalent to assume the following pdf for :
where is the number of elements in cluster .
2.1 Parameter estimation for one distribution
To have the flavour of the general method, let us first assume that all data points come from only one compound-Gaussian model. In order to fully characterize the distribution, one would have to fix the parameters and and determine , the pdf of . Our approach of considering the
’s as unknown parameters (and hence not dealing with a non-parametric model) comes from the fact that it is actually difficult (and possibly not necessary) to estimate the distribution ofif we assume noise and heavy tails. In the sequel, we hence consider one deterministic for each data point as for instance in [Fred]. In consequence, we need to estimate , and for all with all the samples of this distribution.
If we consider this approximated distribution then with . It is well-known (see e.g. [Anderson84]) that the Maximum-Likelihood estimators for once is fixed, the sample covariance estimator is the maximum likelihood estimator for and are given by
Notice that in such a case, i.e., omitting the particular structure of w.r.t. the ’s, does not depend on index (summation is done over all ’s).
However, as the parameter of interest is instead of , the authors of [Fred] adopt a two-step strategy to alternatively maximize the likelihood function w.r.t. to each parameter. In a first step, one estimates , then each given and , then one estimates given and each . The second step consists in plug-in the resulting estimates. Hereafter are the solutions of the first-step :
Finally, the estimators are given as the solutions of the following equations:
This equations system, and particularly the fixed-point equations for estimating and , are solved with an initial value and iterating until convergence as showed and discussed in [Fred, frontera2014hyperspectral].
2.2 Parameter estimation for the mixture model
For the clustering task, we start with compound-Gaussian distributions, but, importantly, the specific distribution for each cluster is not perfectly known, since the pdf of the ’s is not specified. Similarly to the EM algorithm, we extend the model with discrete variables (with ), that are not observed, representing the cluster for each observation . We compute the label for each observation and cluster in the E-step, while in the M-step we estimate . Given , the expectation of the extended likelihood of is derived from (1) as follows
where, with and .
Given the data points and the extended likelihood of the model as in (2.2), the maximization w.r.t. each parameter of the model contained in keeping the rest of the parameters fixed leads to
for the proportion of each distribution,
for the mean of each distribution,
for the covariance matrices and
for all scale parameters.
We proceed by maximizing the likelihood expectation with respect to the different parameters. Note that there is a constraint on the proportions , which forces us to a Lagrange multiplier. Solving the system of equations composed by
together with the conditions and , we get the expression given in (14).
Taking the derivative with respect to we find
and setting this to zero we obtain
This is exactly as in (15).
Now, in order to estimate we differentiate with respect to and get
Equating this to zero we find (16).
Finally, differentiation with respect to leads to
Then, equating to zero leads to (17).
Finally, we can justify that these critical points are actually local extremes thanks to usual equations derived from the expectation of the Gaussian likelihood.
As follows from the derivation of Proposition 1, the estimator for the parameter is given by
and the fixed-point equations
hold for the rest of the estimators with as in (16).
It is important to notice that the derivation of estimators in our model results in usual robust estimators for the position and dispersion parameters of each distribution. More specifically, both can be assimilated to -estimators with a certain function [maronna]. Actually, the scatter matrix estimator is very close to the corresponding Tyler’s -estimator while the location parameter estimator is close to the corresponding Tyler’s -estimator (up to the exponent parameter of (see [tyler1987distribution, frontera2014hyperspectral] for more details). Main differences arise from the mixture model that leads to different weights involved by the different distributions. This approach can be seen as a generalization of Tyler’s -estimators to the mixture case.
Thus, can be written as
while can be written as
This reaffirms the robust character of our proposal.
The general structure of the proposed algorithm is the same as the one of the classical EM algorithm. The main difference between both algorithms relies in the recursive update equations for the parameter estimations. More precisely, based on equations (20) and (21), we consider four slightly different versions of the algorithm depending on two different aspects:
In the first version, the parameter used to compute the estimator is taken to be the one obtained in the same iteration of the fixed-point loop.
In version 2, the -parameter is instead taken to be the one obtained in the previous iteration.
Based on the similarity with -estimators, version 3 proposes an accelerated method where expressions such as in the denominators of the fixed-point equations are replaced by their square.
Finally, version 4 implements the same acceleration procedure on top of the algorithm of version 2.
For concreteness, the complete algorithm in versions 1 and 4 is shown in Algorithm 1.
In the case of low dimensions, we truncate the value because of numerical problems when cluster points tend to be very close to the mean. That is, if is smaller than a threshold we change its value to a selected threshold.
3 Experimental results
In this section, we present experiments with synthetic and real data. We study the convergence of the fixed point equations and the estimation error in the case of the synthetic data where we know the true parameter values. Additionally, for the real data we compare the clustering results with the ground truth labels for k-means, the classical EM, EM for Student’s
-distributions, HDBSCAN, spectral clustering[spectral] and our robust algorithm. The comparison between the former three and our algorithm is straight-forward because they all have in common only one main parameter (the number of clusters) that we fix and suppose known in our experiments. In the case of spectral clustering, it is necessary to tune an extra parameter in order to build the neighborhood graph. We set the number of neighbors to be considered in the graph as the number that maximizes the silhouette score [silhouettes]. A fair comparison with HDBSCAN is even more difficult to set and we evaluated this algorithm by sweeping a grid of selected values. We then quantify the differences of performance by using the usual metrics for the clustering task known as adjusted mutual information index (AMI) and adjusted rand index (AR). In some cases, we visualize the 2D embedding of the data obtained by the UMAP [UMAP] algorithm colored with the resulting labels of the different clustering algorithms. This dimensional reduction algorithm has the same goal than t-SNE [tSNE] but every decision in the design is justified by topological theory.
3.1 Synthetic data
In order to compare the clustering performance of the different algorithms we simulate settings with different distributions for and different dimensions . We repeat the experiment times and collect the mean values shown in tables 1 and 2.
The plots in figure 2 show the convergence of the fixed-point equations for the estimation of and for the different versions of the algorithm. As one can see in both cases, the convergence is reached in all versions after approximately twenty iterations of the fixed-point loop. No significant differences arise.
Furthermore, we studied if the estimations were indeed accurate. We simulated different scenarios to study the error of estimation. We see in Table 1 that the error values are smaller in the case of our flexible and robust algorithm (FR). This can be explained by the robustness of the estimators.
We finally show the mean metrics comparing GMM, Student’s t EM and our algorithm. As seen in the Table 2, even in the Student’s case where the second algorithm is exact, our robust algorithm performs better in average.
|distribution||algorithm||mu error||sigma error|
3.2 Real data
We tested our clustering algorithm in three different real datasets: MNIST [MNIST], small NORB [NORB] and 20newsgroup [20newsgroup]. The MNIST hand-written digits (Figure 3) dataset has become a standard benchmark for classification methods. We apply our flexible and robust model (FR) to discover groups in balanced subsets of similar pairs of digits (3-8 and 1-7) and the set of digits (3-8-6). We additionally contaminated the later subset with a small portion of noise, we randomly added some of the remaining different digits.
As usual in most application examples in the literature, we applied PCA to work with some meaningful features instead of the original data. To make a trade-off between explained variance and curse of dimensionality effects we keptvariables. As can be seen in Tables 3 and 4, we obtain, in most cases, better values for both metrics than those produced by the other partitioning techniques. This can be explained by the increment in flexibility and the smaller impact of outliers in the estimation process. Comparing our algorithm to spectral clustering, ours usually obtain better results. The cases where the metrics are worse correspond to simpler scenarios without noise and with well separated clusters.
We collected the clustering results from the HDBSCAN algorithms fed with a grid of values for its two main parameters. All the computed metrics comparing the results with the ground truth were bad, close to 0. We show the best clustering result of the 3-8 MNIST subset in Figure 4
where a big number of data points is classified as noise by the algorithm. If the metric is computed only in the non-noise labeled data points then the clustering is almost perfect. This behaviour can be explained by the number of dimensions, that seems to be still big for HDBSCAN to deal with.
Additionally, we tried dimensional reduction techniques UMAP and t-SNE prior to the clustering task. All metrics were improved after a carefully tuning of parameters. In this scenario, our method performs similar to the classical GMM because these embedding methods tend to attract outliers and noise to clusters. However, these non-linear visualization approaches are not recommended to extract features before clustering because fictitious effects can appear depending on the parameters choice.
For the NORB dataset (some representatives are shown in Figure 5), k-means, GMM, spectral clustering and UMAP+HDBSCAN do not perform in a satisfactory way since they end-up capturing the luminosity as the main classification aspect. In contrast, -EM and the algorithm presented in this paper outperform them by far, as can be seen in Tables 3 and 4. This is evident from Figure 6, where label-colored two-dimensional embeddings of the data based on the classification produced by the different methods are shown. The effect of extreme light values seems to be faded by the robust parameter estimators.
We study the changes after centering each picture so that the mean is equal for all images. In this case, all methods have similar performance. Comparing all metrics from both cases, our metrics from the data without centering are the best.
Finally, the 20newsgroup dataset is roughly a bag of words constructed from a corpus of news classified by topic modelling into twenty groups. Once again, we compare the performance of our methods with that of k-means, EM and t-EM algorithms. The corresponding results are also presented in Tables 3 and 4. One can see that k-means and spectral clustering perform poorly, and we obtain remarkably better results also with respect to EM and t-EM.
In this paper we presented a robust clustering algorithm that models the data distribution as a mixture of compound-Gaussians. It is based on an approximation of the scale random variable as deterministic parameters. The flexibility of this model makes it particularly suited for analyzing heavy-tailed and/or noise-contaminated data.
More precisely, we introduced four somewhat similar versions of the algorithm that differ only on the iterative resolution of the fixed-point system of equations.
We applied our algorithm to cluster both synthetic and real data. The former are mainly Gaussian and t-distributed randomly generated data. For the latter we considered three well-known datasets, namely the so-called MNIST and NORB images sets, and also the 20newsgroups bag of words dataset. In all cases convergence was reached.
For the simulated data we obtained accurate estimations and good classifications rates. Of course, the best model is the one that coincides with the distribution of the data, e.g. when the mixture is actually Gaussian GMM outperforms all other methods, including ours.