1 Introduction
Classification aims to partition data into a set number of groups, whereby observations in the same group are in some sense similar to one another. Clustering is unsupervised classification, in that none of the group memberships are known a priori
. Most clustering algorithms originate from one of three major methods: hierarchical clustering,
means clustering, and mixture modelbased clustering. Although hierarchical and means clustering are still used, the mixture modelling approach has become increasingly popular due to its robustness and mathematical interpretability. In the mixture modelling framework for clustering, each component is usually taken to be a cluster. Although the model can employ almost any component distribution, Gaussian components remain popular due to the distribution’s versatility and ubiquity. Most mixture modelbased clustering methods assume, either explicitly or implicitly, that the data are free of outliers.Mixture modelbased clustering involves maximizing the likelihood of the mixture model. The density of a Gaussian mixture model is a convex linear combination of each component density, i.e.,
(1) 
where
(2) 
is the density of a
dimensional random variable
from a Gaussian distribution with mean and covariance matrix , is the mixing proportion such that , and .1.1 Accounting for outliers
Outliers, particularly those with high leverage, can significantly affect the parameter estimates. It is thus beneficial to remove, or reduce, the effect of outliers by accounting for them in the model. In modelbased clustering, we can incorporate outliers in several ways. The first method, proposed by Banfield and Raftery (1993), includes outliers in an additional uniform component over the convex hull. If outliers are clusterspecific, we can incorporate them into the tails if we cluster using mixtures of tdistributions (Peel and McLachlan, 2000). Punzo and McNicholas (2016) introduce mixtures of contaminated Gaussian distributions, where each cluster has a proportion of ‘good’ points with density , and a proportion of ‘bad’ points, with density
. Each distribution has the same centre, but the ‘bad’ points have an inflated variance, i.e.,
.Instead of fitting outliers in the model, it may be of interest to trim them from the dataset. CuestaAlbertos et al. (1997) developed an impartial trimming approach for means clustering; however, this method maintains the drawback of means clustering, where the clusters are spherical with equal—or similar, in practice—radii. GarcíaEscudero et al. (2008) improved upon trimmed
means with the TCLUST algorithm. TCLUST places a restriction on the eigenvalue ratio of the covariance matrix, as well as implementing a weight on the clusters, allowing for clusters of various elliptical shapes and sizes. An obvious challenge with these methods is that the proportion of outliers, denoted
, must be known a priori; furthermore, in TCLUST, the eigenvalue ratio must also be known. It is of great interest to bring trimming into the modelbased clustering domain, especially when is unknown, as is the case for most real datasets—in fact, for all but very low dimensional data.2 Distribution of LogLikelihoods
2.1 Distribution of Subset LogLikelihoods using Population Parameters
Consider a dataset in dimensional Euclidian space . Define the th subset as . Suppose each has Gaussian mixture model density as in (1). The loglikelihood of dataset under the Gaussian mixture model is
(3) 
Assumption 1.
The clusters are nonoverlapping and well separated.
Assumption 1 is required to simplify the model density to the component density, as shown in Lemma 1. In practice, however, these assumptions may be relaxed. For more information on the effect of cluster separation on the density, see Appendix A.
Write to indicate that belongs to the th cluster. Let , where if and if .
Lemma 1.
As the separation between the clusters increases, , i.e., the loglikelihood in (3) converges asymptotically to , where
(4) 
A proof of Lemma 1 may be found in Appendix B. We will maintain Assumption 1 throughout this paper. Using Lemma 1, an approximate loglikelihood for the mixture model is
(5) 
which can be regarded as the approximate loglikelihood for the entire dataset . Define as the approximate loglikelihood for the th subset .
Proposition 1.
If and , then , where , and
for and .
The requisite mathematical results are given in the following lemmata.
Lemma 2.
For ,
(6) 
where
Proof.
Population parameters and , , are impervious to the sample drawn from the dataset and remain constant for each subset , . Thus, the approximate loglikelihood for the th subset, , when is
(7) 
Rearranging (7) yields
(8) 
∎
Lemma 3.
This result is stated as Corollary 3.2.1.1 in Mardia et al. (1979).
Lemma 4.
.
Proof.
If , then . Thus,
by the scaling property of the gamma distribution. ∎
Let , , and . Then,
(9) 
for .
2.2 Distribution of Subset LogLikelihoods using Sample Parameter Estimates
Generally, population parameters and are unknown a priori. We can replace the population parameters with parameter estimates
(10) 
where is the number of observations in .
Assumption 2.
The number of observations in each cluster, , is large.
This is assumption required for the following lemmata.
Lemma 5.
Sample parameter estimates are asymptotically equal for all subsets:
(11) 
where and are the sample mean and sample covariance, respectively, for the th cluster considering all observations in the entire dataset , and and are the sample mean and sample covariance, respectively, for the th cluster considering only observations in the the th subset .
Proof.
If , then the equality trivially holds for all . For ,
(12) 
Thus as . Therefore, and so
(13) 
Thus as , so . ∎
Lemma 6.
Ververidis and Kotropoulos (2008) prove this result for all satisfying .
Proposition 2.
For , with and ,
(15) 
for .
Proof.
We will perform a change of variables. Let and . Then
The inverse function is
The absolute value of the derivative of with respect to is
Because is betadistributed, its density is
(16) 
for , , and . The transformation of variables allows the density of to be written
(17) 
The density of becomes
(18) 
for , , and . Thus, is betadistributed with
(19) 
∎
Because applies to any , with , let . Proposition 2 can be applied to generate the density of the mixture model with variable for any , i.e.,
(20) 
with .
Remark 1.
has density from (20) when typical model assumptions hold. If the density in (20) does not describe the distribution of subset loglikelihoods, i.e. they are not betadistributed, then we can conclude that at least one model assumption fails. In this case, we will assume that only the outlier assumption has been violated and that there are, in fact, outliers in the model.
3 OCLUST Algorithm
Let be the set of subset loglikelihoods generated from the data, i.e., is the realization of random variable . We propose testing the adherence of to the reference distribution in (20) as a way to test for the presence of outliers. In other words, if is not betadistributed, then outliers are present in the model. Because is asymptotically equal to , we will use . This is important because we will need
for outlier identification and, additionally, it is outputted by many existing clustering algorithms. The algorithm described below utilizes the loglikelihood and parameter estimates calculated using the expectationmaximization (EM) algorithm
(Dempster et al., 1977) for Gaussian modelbased clustering; however, other methods may be used to estimate parameters and the overall loglikelihood.OCLUST both identifies likely outliers and determines the proportion of outliers within the dataset. The OCLUST algorithm assumes all model assumptions hold, except that outliers are present. The algorithm involves removing points onebyone until the density in (20) describes the distribution of , which is determined using KullbackLeibler (KL) divergence, estimated via relative frequencies. Notably, KL divergence generally decreases as outliers are removed and the model improves. Once all outliers are removed, KL divergence increases again as points are removed from the tails. We select the number of outliers as the location of the global minimum. With each iteration, we remove the most likely outlier.
Definition 1 (Most likely outlier).
With each iteration, we define the most likely outlier as , where
(21) 
In other words, we assign the th point as outlying if the loglikelihood is greatest when point is removed. The OCLUST algorithm proceeds as follows:

Initialize the parameters:

Cluster the data into clusters using the EM algorithm, placing no restrictions on the covariance structure, and calculate the loglikelihood of the clustering solution, i.e., .

Calculate the sample covariance , the number of points , and the proportion of points for each cluster.


Calculate the KL divergence:

Create new datasets , each with one removed.

Cluster each of the datasets into clusters, placing no restrictions on the covariance structure, and calculating the loglikelihood for each solution.

Create a new set
where is the set of realized values for variable .


Determine the most likely outlier , as defined in Definition 1.

Update the values


Perform iterations of Steps 1–4 until an upper bound, , of desired outliers is obtained and the resulting KL divergence is calculated.

Choose the number of outliers as the value for which the KL divergence is minimized.
4 Simulation Study
The following simulation study tests the performance of OCLUST against the popular trimming method TCLUST. The datasets were generated to closely mimic those used in GarcíaEscudero et al. (2008) and, as such, the simulation scheme and notation used here are borrowed from GarcíaEscudero et al. (2008). Datasets containing three clusters with means , and , respectively, were generated with , and or . Covariance matrices were generated of the forms:
(22) 
With different combinations for , we generate five different models:

[label=.]

, spherical clusters with equal volumes;

, diagonal clusters with equal covariance matrices;

, clusters with equal volumes, but varying shapes and orientations;

, clusters with varying volumes, shapes, and orientations; and

, clusters with varying volumes, shapes, and orientations but two with severe overlap.
To fix the proportion of outliers to , each dataset had ‘regular’ observations and 100 outliers. Outliers were generated uniformly in the parallelotope defined by the coordinatewise maxima and minima of the ‘regular’ observations, accepting only those points with Mahalanobis squared distances greater than . Datasets either had equal cluster proportions () or unequal proportions (). Ten datasets were generated with each combination of parameters (dimension, cluster proportions, model).
TCLUST was run using the tclust (Fritz et al., 2012) package in R (R Core Team, 2018) with outlier proportion set to and the eigenvalue restriction . To compare OCLUST to TCLUST, outputs are compared when OCLUST is stopped at . The accuracy of each method is described in Table 1, where the accuracy is defined as the proportion of generated outliers labelled as outliers by the algorithm. OCLUST and TCLUST perform with similar accuracy, returning a difference of less than one outlier on average, in every case but in Model V with and equal cluster proportions. This might be due to the overlapping nature of the generated clusters. However, it may be partly a result of the decision to stop OCLUST at for comparison with TCLUST—for this dataset, OCLUST tended to find a higher proportion of outliers (see Table 2).
Cluster Proportions  Dimension  Model  OCLUST  TCLUST 

Equal  I  .973  .973  
()  II  .969  .968  
III  .971  .973  
IV  .962  .963  
V  .893  .927  
I  .970  .970  
II  .966  .969  
III  .973  .975  
IV  .952  .960  
V  .965  .966  
Unequal  I  .971  .971  
()  II  .963  .964  
III  .969  .970  
IV  .949  .957  
V  .920  .922  
I  .970  .971  
II  .961  .962  
III  .970  .968  
IV  .948  .958  
V  .953  .956 
To test OCLUST’s ability to predict the number of outliers, the algorithm was run with an upper bound . The prediction for the proportion of outliers returned by OCLUST is summarized in Table 2. It is important to note that must be correctly specified in order for TCLUST to perform well. Under or overspecification for results in misclassification error, as shown in app:alpha. Crucially, OCLUST predicts very well overall, with the true value for
always falling within one standard deviation of the mean (Table
2).Cluster Proportions  Dimension  Model  Mean  SD 

Equal  I  .1008  .00551  
()  II  .0987  .00353  
III  .0999  .00502  
IV  .1001  .00375  
V  .1022  .00757  
I  .1018  .00514  
II  .1016  .00254  
III  .0997  .00267  
IV  .1036  .00624  
V  .1010  .00525  
Unequal  I  .1000  .00455  
()  II  .0991  .00300  
III  .1010  .00450  
IV  .1028  .00505  
V  .1020  .00514  
I  .1003  .00371  
II  .1012  .00579  
III  .1023  .00371  
IV  .0985  .00502  
V  .0968  .00614 
5 Discussion
It was proved that, for data from a Gaussian mixture, the loglikelihoods of the subset models are betadistributed. This result was used to determine the number of outliers by removing outlying points until the subset loglikelihoods followed this derived distribution. The result is the OCLUST algorithm, which trims outliers from a dataset and predicts the proportion of outliers. In comparisons, OCLUST performs comparably to TCLUST when both approaches are given the correct proportion of outliers a priori. Crucially, however, OCLUST does not require a priori specification of the proportion of outliers, and its ability to estimate the proportion of outliers has been illustrated.
Although this work used the distribution of the loglikelihoods of the subset models to test for the presence of outliers, the derived distribution may be used to verify other underlying model assumptions, e.g., whether the clusters are Gaussian. Note that the OCLUST algorithm could be used with other clustering methods and should be effective so long as is it reasonable to assume that the underlying distribution of clusters is Gaussian. Of course, one could extend this work by deriving the distribution of subset loglikelihoods for mixture models with nonGaussian components.
Appendix A Relaxing Assumptions
Lemma 1 assumes that the clusters are well separated and nonoverlapping to simplify the model density to the component density. This section, however, serves to show that this assumption may be relaxed in practice. Following Qiu and Joe (2006), we can quantify the separation between clusters using the separation index ; in the univariate case,
(23) 
where is the sample lower quantile and is the sample upper quantile of cluster , and cluster 1 has lower mean than cluster 2. In the multivariate case, the separation index is calculated along the projected direction of maximum separation. Clusters with are separated, clusters with overlap, and clusters with are touching.
To measure the effect of separation index on the approximate loglikelihood , 100 random datasets with for each combination were generated using the clusterGeneration (Qiu and Joe, 2015) package in R. Data were created with three clusters with equal cluster proportions with dimensions and separation indices in . Covariance matrices were generated using random eigenvalues . The parameters were estimated using the mclust package (Scrucca et al., 2016). The loglikelihoods using the full and approximate densities were calculated using the parameter estimates. The average proportional change in loglikelihood over the 100 datasets between the full loglikelihood and the approximate loglikelihood , i.e., , is reported in Table 3. A graphical representation of the results is shown in fig:sepvslik.
Separation  Difference in LogLikelihoods ()  

Value  
0.9  1.13E+01  5.14E+00  3.03E+00 
0.8  1.03E+01  5.00E+00  2.99E+00 
0.7  9.70E+00  4.82E+00  2.89E+00 
0.6  9.15E+00  4.38E+00  2.67E+00 
0.5  7.74E+00  3.90E+00  2.35E+00 
0.4  6.46E+00  3.21E+00  2.00E+00 
0.3  4.86E+00  2.43E+00  1.44E+00 
0.2  3.25E+00  1.57E+00  9.76E01 
0.1  1.83E+00  8.86E01  5.54E01 
0  8.18E01  4.06E01  2.59E01 
0.1  2.58E01  1.32E01  8.51E02 
0.2  5.05E02  2.51E02  1.60E02 
0.3  4.24E03  2.29E03  1.24E03 
0.4  1.63E05  8.48E06  2.52E05 
0.5  3.94E10  6.49E11  2.85E10 
0.6  0*  0*  0* 
0.7  0*  0*  0* 
0.8  0*  0*  0* 
0.9  0*  0*  0* 
As one would expect, the approximation of by improves as the separation index increases (see Lemma 1). However, the difference is negligible for touching and separated clusters (). In the simulations in Section 4, the most overlapping clusters have , which produces an error of less than in two dimensions, and less than in six dimensions. In this case, the approximation is appropriate.
Appendix B Mathematical Results
b.1 Proof of Lemma 1
Proof.
Suppose is positive definite. Then, is also positive definite and there exists such that and is diagonal with . Let , where . Now,
because . Thus, as and
Suppose . Then, as the clusters separate, and for . Thus, for ,
(24) 
Thus,
(25) 
∎
Remark 2.
Although covariance matrices need only be positive semidefinite, we restrict to be positive definite so that is not degenerate.
Appendix C The Effect of on TCLUST Misclassification Error
TCLUST requires the proportion of outliers to be specified in advance. fig:mist shows the average misclassification error when TCLUST is performed on Model I with , over a range of values
. Overspecification results in too many points being classified as outliers, and underspecification results in outliers being classified as ‘regular’ data points. Error is minimized when
is specified correctly, but this is hard to do in practice, especially when is not small.Acknowledgements
This work was supported by an NSERC Undergraduate Research Award, the Canada Research Chairs program, and an E.W.R. Steacie Memorial Fellowship.
References
 Banfield and Raftery (1993) Banfield, J. D. and A. E. Raftery (1993). Modelbased Gaussian and nonGaussian clustering. Biometrics 49(3), 803–821.
 CuestaAlbertos et al. (1997) CuestaAlbertos, J. A., A. Gordaliza, and C. Matrán (1997, 04). Trimmed means: an attempt to robustify quantizers. The Annals of Statistics 25(2), 553–576.
 Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38.

Fritz et al. (2012)
Fritz, H., L. A. GarcíaEscudero, and A. MayoIscar (2012).
tclust: An R package for a trimming approach to cluster analysis.
Journal of Statistical Software 47(12), 1–26.  GarcíaEscudero et al. (2008) GarcíaEscudero, L. A., A. Gordaliza, C. Matrán, and A. MayoIscar (2008). A general trimming approach to robust cluster analysis. The Annals of Statistics 36(3), 1324–1345.

Gnanadesikan and
Kettenring (1972)
Gnanadesikan, R. and J. R. Kettenring (1972).
Robust estimates, residuals, and outlier detection with multiresponse data.
Biometrics 28(1), 81–124.  Mardia et al. (1979) Mardia, K. V., J. T. Kent, and J. M. Bibby (1979). Multivariate Analysis. London: Academic Press.
 Peel and McLachlan (2000) Peel, D. and G. J. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and Computing 10(4), 339–348.

Punzo and
McNicholas (2016)
Punzo, A. and P. D. McNicholas (2016, 11).
Parsimonious mixtures of multivariate contaminated normal distributions.
Biometrical Journal 58(6), 1506–1537.  Qiu and Joe (2006) Qiu, W. and H. Joe (2006). Separation index and partial membership for clustering. Computational Statistics & Data Analysis 50(3), 585 – 603.
 Qiu and Joe (2015) Qiu, W. and H. Joe (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.4.
 R Core Team (2018) R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
 Scrucca et al. (2016) Scrucca, L., M. Fop, T. B. Murphy, and A. E. Raftery (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal 8(1), 205–233.
 Ververidis and Kotropoulos (2008) Ververidis, D. and C. Kotropoulos (2008, July). Gaussian mixture modeling by exploiting the mahalanobis distance. IEEE Transactions on Signal Processing 56(7), 2797–2811.
Comments
There are no comments yet.