 # Using Subset Log-Likelihoods to Trim Outliers in Gaussian Mixture Models

Mixtures of Gaussian distributions are a popular choice in model-based clustering. Outliers can affect parameters estimation and, as such, must be accounted for. Algorithms such as TCLUST discern the most likely outliers, but only when the proportion of outlying points is known a priori. It is proved that, for a finite Gaussian mixture model, the log-likelihoods of the subset models are beta-distributed. An algorithm is then proposed that predicts the proportion of outliers by measuring the adherence of a set of subset log-likelihoods to a beta reference distribution. This algorithm removes the least likely points, which are deemed outliers, until model assumptions are met.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Classification aims to partition data into a set number of groups, whereby observations in the same group are in some sense similar to one another. Clustering is unsupervised classification, in that none of the group memberships are known a priori

. Most clustering algorithms originate from one of three major methods: hierarchical clustering,

-means clustering, and mixture model-based clustering. Although hierarchical and -means clustering are still used, the mixture modelling approach has become increasingly popular due to its robustness and mathematical interpretability. In the mixture modelling framework for clustering, each component is usually taken to be a cluster. Although the model can employ almost any component distribution, Gaussian components remain popular due to the distribution’s versatility and ubiquity. Most mixture model-based clustering methods assume, either explicitly or implicitly, that the data are free of outliers.

Mixture model-based clustering involves maximizing the likelihood of the mixture model. The density of a Gaussian mixture model is a convex linear combination of each component density, i.e.,

 f(x∣ϑ)=G∑g=1πgϕ(x∣μg,Σg), (1)

where

 ϕ(x∣μg,Σg)=1√(2π)p|Σg|exp{−12(x−μg)′Σ−1g(x−μg)} (2)

is the density of a

-dimensional random variable

from a Gaussian distribution with mean and covariance matrix , is the mixing proportion such that , and .

### 1.1 Accounting for outliers

Outliers, particularly those with high leverage, can significantly affect the parameter estimates. It is thus beneficial to remove, or reduce, the effect of outliers by accounting for them in the model. In model-based clustering, we can incorporate outliers in several ways. The first method, proposed by Banfield and Raftery (1993), includes outliers in an additional uniform component over the convex hull. If outliers are cluster-specific, we can incorporate them into the tails if we cluster using mixtures of t-distributions (Peel and McLachlan, 2000). Punzo and McNicholas (2016) introduce mixtures of contaminated Gaussian distributions, where each cluster has a proportion of ‘good’ points with density , and a proportion of ‘bad’ points, with density

. Each distribution has the same centre, but the ‘bad’ points have an inflated variance, i.e.,

.

Instead of fitting outliers in the model, it may be of interest to trim them from the dataset. Cuesta-Albertos et al. (1997) developed an impartial trimming approach for -means clustering; however, this method maintains the drawback of -means clustering, where the clusters are spherical with equal—or similar, in practice—radii. García-Escudero et al. (2008) improved upon trimmed

-means with the TCLUST algorithm. TCLUST places a restriction on the eigenvalue ratio of the covariance matrix, as well as implementing a weight on the clusters, allowing for clusters of various elliptical shapes and sizes. An obvious challenge with these methods is that the proportion of outliers, denoted

, must be known a priori; furthermore, in TCLUST, the eigenvalue ratio must also be known. It is of great interest to bring trimming into the model-based clustering domain, especially when is unknown, as is the case for most real datasets—in fact, for all but very low dimensional data.

## 2 Distribution of Log-Likelihoods

### 2.1 Distribution of Subset Log-Likelihoods using Population Parameters

Consider a dataset in -dimensional Euclidian space . Define the th subset as . Suppose each has Gaussian mixture model density as in (1). The log-likelihood of dataset under the Gaussian mixture model is

 ℓX=n∑i=1log[G∑g=1πgϕ(xi∣μg,Σg)]. (3)
###### Assumption 1.

The clusters are non-overlapping and well separated.

Assumption 1 is required to simplify the model density to the component density, as shown in Lemma 1. In practice, however, these assumptions may be relaxed. For more information on the effect of cluster separation on the density, see Appendix A.

Write to indicate that belongs to the th cluster. Let , where if and if .

###### Lemma 1.

As the separation between the clusters increases, , i.e., the log-likelihood in (3) converges asymptotically to , where

 QX=n∑i=1G∑g=1ziglog[πgϕ(xi∣μg,Σg)]=∑xi∈Cglog[πgϕ(xi∣μg,Σg)]. (4)

A proof of Lemma 1 may be found in Appendix B. We will maintain Assumption 1 throughout this paper. Using Lemma 1, an approximate log-likelihood for the mixture model is

 QX=∑xi∈Cg[logπg+logϕ(xi∣μg,Σg)], (5)

which can be regarded as the approximate log-likelihood for the entire dataset . Define as the approximate log-likelihood for the th subset .

###### Proposition 1.

If and , then , where , and

 f\emphgamma(w∣k,θ)=1Γ(k)θkwk−1\emphexp{−w/θ},

for and .

The requisite mathematical results are given in the following lemmata.

###### Lemma 2.

For ,

 QX∖xj−QX=−logπh+p2log(2π)+12log\absΣh+12τj, (6)

where

 τj=(xj−μh)′Σ−1h(xj−μh).
###### Proof.

Population parameters and , , are impervious to the sample drawn from the dataset and remain constant for each subset , . Thus, the approximate log-likelihood for the th subset, , when is

 QX∖xj=QX−logπh−logϕ(xj∣μh,Σh). (7)

Rearranging (7) yields

 QX∖xj−QX=−logπh+p2log(2π)+12log\absΣh+12τj. (8)

###### Lemma 3.

This result is stated as Corollary 3.2.1.1 in Mardia et al. (1979).

.

###### Proof.

If , then . Thus,

by the scaling property of the gamma distribution. ∎

Let , , and . Then,

 Yj∼fgamma(yj−c∣p/2,1), (9)

for .

### 2.2 Distribution of Subset Log-Likelihoods using Sample Parameter Estimates

Generally, population parameters and are unknown a priori. We can replace the population parameters with parameter estimates

 (10)

where is the number of observations in .

###### Assumption 2.

The number of observations in each cluster, , is large.

This is assumption required for the following lemmata.

###### Lemma 5.

Sample parameter estimates are asymptotically equal for all subsets:

 ¯xg∖j≃¯xg,Sg∖j≃Sg, (11)

where and are the sample mean and sample covariance, respectively, for the th cluster considering all observations in the entire dataset , and and are the sample mean and sample covariance, respectively, for the th cluster considering only observations in the the th subset .

###### Proof.

If , then the equality trivially holds for all . For ,

 ¯xh∖j=nh¯xh−xjnh−1. (12)

Thus as . Therefore, and so

 Sh∖j≃(nh−1)Sk−(xj−¯xh)(xj−¯xh)′nh−2. (13)

Thus as , so . ∎

Using the sample parameter estimates, (8) becomes

 QX∖xj−QX=−logπh+p2log(2π)+12log\absSh+12tj, (14)

where .

###### Lemma 6.

(From Gnanadesikan and Kettenring, 1972) When ,

 n(n−1)2Tj∼f\emphbeta(n(n−1)2tj ∣∣∣ p2,n−p−12),

for .

Ververidis and Kotropoulos (2008) prove this result for all satisfying .

For , with and ,

 (15)

for .

###### Proof.

We will perform a change of variables. Let and . Then

 Yj=(nh−1)22nhXj+c.

The inverse function is

 xj=v(yj)=2nh(nh−1)2(yj−c).

The absolute value of the derivative of with respect to is

 \absdxjdyj=\abs2nh(nh−1)2=2nh(nh−1)2.

Because is beta-distributed, its density is

 fbeta(xj | α,β)=Γ(α+β)Γ(α)Γ(β)xα−1j(1−xj)β−1, (16)

for , , and . The transformation of variables allows the density of to be written

 fY(yj)=fX(v(yj))\absdxjdyj. (17)

The density of becomes

 fY(yj)=2nh(nh−1)2Γ(α+β)Γ(α)Γ(β)[2nh(nh−1)2(yj−c)]α−1[1−2nh(nh−1)2(yj−c)]β−1, (18)

for , , and . Thus, is beta-distributed with

 (19)

Because applies to any , with , let . Proposition 2 can be applied to generate the density of the mixture model with variable for any , i.e.,

 f(y∣ϑ)=G∑g=1πgfg(y∣θg), (20)

with .

###### Remark 1.

has density from (20) when typical model assumptions hold. If the density in (20) does not describe the distribution of subset log-likelihoods, i.e. they are not beta-distributed, then we can conclude that at least one model assumption fails. In this case, we will assume that only the outlier assumption has been violated and that there are, in fact, outliers in the model.

## 3 OCLUST Algorithm

Let be the set of subset log-likelihoods generated from the data, i.e., is the realization of random variable . We propose testing the adherence of to the reference distribution in (20) as a way to test for the presence of outliers. In other words, if is not beta-distributed, then outliers are present in the model. Because is asymptotically equal to , we will use . This is important because we will need

for outlier identification and, additionally, it is outputted by many existing clustering algorithms. The algorithm described below utilizes the log-likelihood and parameter estimates calculated using the expectation-maximization (EM) algorithm

(Dempster et al., 1977) for Gaussian model-based clustering; however, other methods may be used to estimate parameters and the overall log-likelihood.

OCLUST both identifies likely outliers and determines the proportion of outliers within the dataset. The OCLUST algorithm assumes all model assumptions hold, except that outliers are present. The algorithm involves removing points one-by-one until the density in (20) describes the distribution of , which is determined using Kullback-Leibler (KL) divergence, estimated via relative frequencies. Notably, KL divergence generally decreases as outliers are removed and the model improves. Once all outliers are removed, KL divergence increases again as points are removed from the tails. We select the number of outliers as the location of the global minimum. With each iteration, we remove the most likely outlier.

###### Definition 1 (Most likely outlier).

With each iteration, we define the most likely outlier as , where

 k=argmaxj∈[1,n]ℓX∖xj. (21)

In other words, we assign the th point as outlying if the log-likelihood is greatest when point is removed. The OCLUST algorithm proceeds as follows:

1. Initialize the parameters:

1. Cluster the data into clusters using the EM algorithm, placing no restrictions on the covariance structure, and calculate the log-likelihood of the clustering solution, i.e., .

2. Calculate the sample covariance , the number of points , and the proportion of points for each cluster.

2. Calculate the KL divergence:

1. Create new datasets , each with one removed.

2. Cluster each of the datasets into clusters, placing no restrictions on the covariance structure, and calculating the log-likelihood for each solution.

3. Create a new set

 Y={ℓX∖xj−ℓX}j=1:n,

where is the set of realized values for variable .

4. Generate the density of using (20) and the parameters from Step 1b. Calculate the approximate KL divergence of to the generated density, using relative frequencies.

3. Determine the most likely outlier , as defined in Definition 1.

4. Update the values

5. Perform iterations of Steps 1–4 until an upper bound, , of desired outliers is obtained and the resulting KL divergence is calculated.

6. Choose the number of outliers as the value for which the KL divergence is minimized.

## 4 Simulation Study

The following simulation study tests the performance of OCLUST against the popular trimming method TCLUST. The datasets were generated to closely mimic those used in García-Escudero et al. (2008) and, as such, the simulation scheme and notation used here are borrowed from García-Escudero et al. (2008). Datasets containing three clusters with means , and , respectively, were generated with , and or . Covariance matrices were generated of the forms:

 Σ1=diag(1,a,1,…,1),Σ2=diag(b,c,1,…,1),Σ3=⎛⎜ ⎜⎝deef00I⎞⎟ ⎟⎠. (22)

With different combinations for , we generate five different models:

1. [label=.]

2. , spherical clusters with equal volumes;

3. , diagonal clusters with equal covariance matrices;

4. , clusters with equal volumes, but varying shapes and orientations;

5. , clusters with varying volumes, shapes, and orientations; and

6. , clusters with varying volumes, shapes, and orientations but two with severe overlap.

To fix the proportion of outliers to , each dataset had ‘regular’ observations and 100 outliers. Outliers were generated uniformly in the -parallelotope defined by the coordinate-wise maxima and minima of the ‘regular’ observations, accepting only those points with Mahalanobis squared distances greater than . Datasets either had equal cluster proportions () or unequal proportions (). Ten datasets were generated with each combination of parameters (dimension, cluster proportions, model).

TCLUST was run using the tclust (Fritz et al., 2012) package in R (R Core Team, 2018) with outlier proportion set to and the eigenvalue restriction . To compare OCLUST to TCLUST, outputs are compared when OCLUST is stopped at . The accuracy of each method is described in Table 1, where the accuracy is defined as the proportion of generated outliers labelled as outliers by the algorithm. OCLUST and TCLUST perform with similar accuracy, returning a difference of less than one outlier on average, in every case but in Model V with and equal cluster proportions. This might be due to the overlapping nature of the generated clusters. However, it may be partly a result of the decision to stop OCLUST at for comparison with TCLUST—for this dataset, OCLUST tended to find a higher proportion of outliers (see Table 2).

To test OCLUST’s ability to predict the number of outliers, the algorithm was run with an upper bound . The prediction for the proportion of outliers returned by OCLUST is summarized in Table 2. It is important to note that must be correctly specified in order for TCLUST to perform well. Under- or over-specification for results in misclassification error, as shown in app:alpha. Crucially, OCLUST predicts very well overall, with the true value for

always falling within one standard deviation of the mean (Table

2).

## 5 Discussion

It was proved that, for data from a Gaussian mixture, the log-likelihoods of the subset models are beta-distributed. This result was used to determine the number of outliers by removing outlying points until the subset log-likelihoods followed this derived distribution. The result is the OCLUST algorithm, which trims outliers from a dataset and predicts the proportion of outliers. In comparisons, OCLUST performs comparably to TCLUST when both approaches are given the correct proportion of outliers a priori. Crucially, however, OCLUST does not require a priori specification of the proportion of outliers, and its ability to estimate the proportion of outliers has been illustrated.

Although this work used the distribution of the log-likelihoods of the subset models to test for the presence of outliers, the derived distribution may be used to verify other underlying model assumptions, e.g., whether the clusters are Gaussian. Note that the OCLUST algorithm could be used with other clustering methods and should be effective so long as is it reasonable to assume that the underlying distribution of clusters is Gaussian. Of course, one could extend this work by deriving the distribution of subset log-likelihoods for mixture models with non-Gaussian components.

## Appendix A Relaxing Assumptions

Lemma 1 assumes that the clusters are well separated and non-overlapping to simplify the model density to the component density. This section, however, serves to show that this assumption may be relaxed in practice. Following Qiu and Joe (2006), we can quantify the separation between clusters using the separation index ; in the univariate case,

 J∗=L2(α/2)−U1(α/2)U2(α/2)−L1(α/2), (23)

where is the sample lower quantile and is the sample upper quantile of cluster , and cluster 1 has lower mean than cluster 2. In the multivariate case, the separation index is calculated along the projected direction of maximum separation. Clusters with are separated, clusters with overlap, and clusters with are touching.

To measure the effect of separation index on the approximate log-likelihood , 100 random datasets with for each combination were generated using the clusterGeneration (Qiu and Joe, 2015) package in R. Data were created with three clusters with equal cluster proportions with dimensions and separation indices in . Covariance matrices were generated using random eigenvalues . The parameters were estimated using the mclust package (Scrucca et al., 2016). The log-likelihoods using the full and approximate densities were calculated using the parameter estimates. The average proportional change in log-likelihood over the 100 datasets between the full log-likelihood and the approximate log-likelihood , i.e., , is reported in Table 3. A graphical representation of the results is shown in fig:sepvslik. Figure 1: Graphical representation of the results in Table 3, showing the effect of cluster separation on the approximate log-likelihood of the model, where the vertical line represents the threshold between separated and overlapping clusters.

As one would expect, the approximation of by improves as the separation index increases (see Lemma 1). However, the difference is negligible for touching and separated clusters (). In the simulations in Section 4, the most overlapping clusters have , which produces an error of less than in two dimensions, and less than in six dimensions. In this case, the approximation is appropriate.

## Appendix B Mathematical Results

### b.1 Proof of Lemma 1

###### Proof.

Suppose is positive definite. Then, is also positive definite and there exists such that and is diagonal with . Let , where . Now,

 (x−μ)′Σ−1(x−μ)=w′QΣ−1Q′w=w′Λw=p∑i=1λiw2i≥infi(λi)p∑i=1w2i=infi(λi)∥w∥2=infi(λi)∥x−μ∥2

because . Thus, as and

 ϕ(x∣μ,Σ)=1√(2π)p|Σ|exp{−12(x−μ)′Σ−1(x−μ)}→0.

Suppose . Then, as the clusters separate, and for . Thus, for ,

 G∑g=1πgϕ(xi∣μg,Σg)=∑g≠hπgϕ(xi∣μg,Σg)+πhϕ(xi∣μh,Σh)≃πhϕ(xi∣μh,Σh). (24)

Thus,

 (25)

###### Remark 2.

Although covariance matrices need only be positive semi-definite, we restrict to be positive definite so that is not degenerate.

## Appendix C The Effect of α on TCLUST Misclassification Error

TCLUST requires the proportion of outliers to be specified in advance. fig:mist shows the average misclassification error when TCLUST is performed on Model I with , over a range of values

. Over-specification results in too many points being classified as outliers, and underspecification results in outliers being classified as ‘regular’ data points. Error is minimized when

is specified correctly, but this is hard to do in practice, especially when is not small. Figure 2: The effect of α specification on misclassification error in TCLUST for the simulation in Appendix C.

### Acknowledgements

This work was supported by an NSERC Undergraduate Research Award, the Canada Research Chairs program, and an E.W.R. Steacie Memorial Fellowship.

## References

• Banfield and Raftery (1993) Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3), 803–821.
• Cuesta-Albertos et al. (1997) Cuesta-Albertos, J. A., A. Gordaliza, and C. Matrán (1997, 04). Trimmed -means: an attempt to robustify quantizers. The Annals of Statistics 25(2), 553–576.
• Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38.
• Fritz et al. (2012) Fritz, H., L. A. García-Escudero, and A. Mayo-Iscar (2012).

tclust: An R package for a trimming approach to cluster analysis.

Journal of Statistical Software 47(12), 1–26.
• García-Escudero et al. (2008) García-Escudero, L. A., A. Gordaliza, C. Matrán, and A. Mayo-Iscar (2008). A general trimming approach to robust cluster analysis. The Annals of Statistics 36(3), 1324–1345.
• Gnanadesikan and Kettenring (1972) Gnanadesikan, R. and J. R. Kettenring (1972).

Robust estimates, residuals, and outlier detection with multiresponse data.

Biometrics 28(1), 81–124.
• Mardia et al. (1979) Mardia, K. V., J. T. Kent, and J. M. Bibby (1979). London: Academic Press.
• Peel and McLachlan (2000) Peel, D. and G. J. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and Computing 10(4), 339–348.
• Punzo and McNicholas (2016) Punzo, A. and P. D. McNicholas (2016, 11).

Parsimonious mixtures of multivariate contaminated normal distributions.

Biometrical Journal 58(6), 1506–1537.
• Qiu and Joe (2006) Qiu, W. and H. Joe (2006). Separation index and partial membership for clustering. Computational Statistics & Data Analysis 50(3), 585 – 603.
• Qiu and Joe (2015) Qiu, W. and H. Joe (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.4.
• R Core Team (2018) R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
• Scrucca et al. (2016) Scrucca, L., M. Fop, T. B. Murphy, and A. E. Raftery (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal 8(1), 205–233.
• Ververidis and Kotropoulos (2008) Ververidis, D. and C. Kotropoulos (2008, July). Gaussian mixture modeling by exploiting the mahalanobis distance. IEEE Transactions on Signal Processing 56(7), 2797–2811.