1 Introduction
Modelbased clustering assumes that the observed data are generated from a mixture of
components, each representing the probability distribution for a different group or cluster
(McLachlan and Peel, 2000). The general form of a finite mixture model is , where the ’s are the mixing probabilities such that and , and are, respectively, the density and the parameters of the th component (). With continuous data, we often take the density for each mixture component to be the multivariate Gaussian with parameters . Thus, clusters are ellipsoidal, centered at the means , and with other geometric features, such as volume, shape and orientation, determined by .Parsimonious parametrization of covariance matrices can be adopted through eigenvalue decomposition in the form , where is a scalar controlling the volume of the ellipsoid, is a diagonal matrix specifying the shape of the density contours, and
is an orthogonal matrix which determines the orientation of the corresponding ellipsoid
(Banfield and Raftery, 1993; Celeux and Govaert, 1995). Fraley and Raftery (2006, Table 1) report some parametrizations of withingroup covariance matrices available in the MCLUST software, and the corresponding geometric characteristics. Maximum likelihood estimates for this type of mixture models can be computed via the EM algorithm (Dempster et al, 1977; Fraley and Raftery, 2002), while model selection could be based on the Bayesian information criterion (BIC) (Fraley and Raftery, 1998) or the integrated completedata likelihood (ICL) criterion (Biernacki et al, 2000).In this paper we propose a dimension reduction method for modelbased clustering, where we pursue the estimation of a reduced subspace which captures most of the clustering structure contained in the data. Following the work of Li (1991, 2000) on Sliced Inverse Regression (SIR), information on the dimension reduction subspace is obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. The estimated directions, ordered by importance as quantified by the associated eigenvalues, span a dimension reduction subspace. Observations may then be plotted on such a subspace, hence providing summary plots which visualize the clustering structure, and may help to understand the clustering and improve accuracy.
In Section 2 we introduce the problem of clustering on a dimension reduced subspace, and we show that assignment of an observation to a given cluster is unchanged if performed on a suitable subspace. Then we discuss estimation of the directions which span the reduced subspace, and we provide some of their properties. In Section 3 we review and adapt a greedy search procedure for selecting a subset of the new constructed variables which retains most of the clustering information contained in the data. In Section 4 we describe a data analysis example based on a synthetic data set; then we discuss results from simulation studies where we compare the proposed approach to modelbased clustering performed on both the original variables and the leading principal components. Section 5 presents applications of the proposed procedure on some real data sets. The final section contains some concluding remarks.
2 Dimension reduction for modelbased clustering
Classical procedures for dimensionality reduction are principal components analysis and factor analysis, both of which reduce dimensionality by forming linear combinations of the features. The first method seeks a lowerdimensional representation that account for the most variance of the features, while the second looks for the most correlation among the features. However, neither method correctly addresses the problem of visualizing any potential clustering structure.
Chang (1983) discussed a simulated example from a 15dimensional mixture model to show the failure of principal components as a method for reducing the dimension of the data before clustering. Let , where , , with mean and covariance matrix , , where the first 8 elements of are and the last 7 are . With this scheme the first 8 variables can be considered roughly as a block of variables with the same correlations, while the rest of the variables form another block. Using this scheme we simulated data points. Figure 1 shows the scatterplot matrix of the first, second and last principal components, and the first GMMDR variable (to be discussed on Section 2.2) obtained from a two components EEE mixture model. As it can be seen, the first two PCs are not able to show any clustering information, but, as discussed by Chang (1983), the first and the last PCs are needed. On the contrary, only one GMMDR variable is required to clearly separate the clusters. Furthermore, the coefficients (under unit norm constraint) of the linear combination defining such a direction reproduce the blocking structure used to simulate the variables, with the first 8 variables having coefficients approximately equal to and the remaining 7 coefficients approximately equal to .
Tipping and Bishop (1999b)
proposed a probabilistic principal component analysis based on a factor analysis model for the data assuming that the distribution of the errors are isotropic, i.e. the covariance matrix is proportional to the identity matrix. They also developed a mixture of probabilistic principal components analyzers model
(Tipping and Bishop, 1999a), and a hierarchical visualization algorithm for exploring clusters (Bishop and Tipping, 1998). McLachlan and Peel (2000) and McLachlan et al (2003)used mixtures of factor analyzers, by postulating that the distribution of the data within any group or latent class follows a factor analysis model. Parsimonious Gaussian mixture models based on mixtures of factor analyzers were discussed by
McNicholas and Murphy (2008). In order to deal with high dimensional data, Bouveyron et al (2007) proposed a family of parsimonious GMMs on clustering subspaces. All such models mainly address the problem of reducing the number of parameters to be estimated by a suitable reparametrization of the covariance matrix. Since they produce estimates of (latent) factor analyzers, they can also be used to reduce dimensionality. However, as noted by McLachlan and Peel (2000, p. 248), the resulting plots are not always useful in showing clustering structure. To support such claim, they presented an example using a simple two clusters synthetic data set where the differences between the group means are only in one variable, and with withingroup variation in the other variables being relatively large. They showed that both principal components and factor analyzers were unable to show the underlying clustering structure (see McLachlan and Peel, 2000, p. 240, 254–255). We applied the GMMDR method proposed in this paper, and the data projected onto the subspace spanned by the first two directions are shown in Figure 2: the two clusters appear well separated with one group showing less variation than the other.2.1 Clustering on a dimension reduced subspace
Consider the vector
of random variables, and the discrete random variable
taking distinct values to indicate the clusters. Let denote a fixed matrix, where , such that . This conditional independence statement says that the distribution of is the same as that of for all values of in its marginal sample space. As a consequence, the vector can be replaced by the vector without loss of clustering information. If , we have effectively reduced the dimension of the predictor vector. The matrix provides a basis for the subspace , which in a regression context is called a dimensionreduction subspace (DRS) for the regression of on (Li, 1991). It can be shown that a minimum DRS may not be unique, but when several of such subspaces exist they all have the same dimension. Let denote the intersection of all DRSs. Under various reasonable conditions given in Cook (1998, Chap. 6), is a unique minimum DRS and it is called central DRS.The assumption implies that , so we may write the density for the th component of the mixture model as
(1) 
and for any two groups, and , the ratio of the densities is
Thus, if the ratio of the conditional densities is the same whether it is computed on the original variables space or on the DRS, so the clustering information is completely contained in
. For instance, consider the simple case of two clusters with bivariate Gaussian distribution
for , where , , and is the identity matrix. The ratio of the densities is thus given byIt is straightforward to see that the direction containing all the clustering information is , so the ratio of the densities projected along such direction is
In modelbased clustering an observation is assigned to cluster by the maximum a posteriori principle (MAP), i.e. to the cluster for which the conditional probability given the data is a maximum:
By equation (1), the last expression is equivalent to
so the assignment of an observation to a cluster is unchanged if performed on the DRS. Cook and Yin (2001, p. 156) considered and defined a discriminant subspace as a subspace such that for all values of , where denotes the projection operator onto with respect to the usual inner product. The intersection of all discriminant subspaces, if it exists, is itself a discriminant subspace, and is called the central discriminant subspace (CDS) and denoted by . Finally, the CDS is a subset of the central DRS, , and in some cases may be a proper subset.
In the case of known group memberships, a variety of methods for dimensionality reduction have been proposed (Cook and Yin, 2001). These methods, such as Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), are not restricted to any particular classification method. In the following subsection we discuss estimation of a dimension reduction subspace for a modelbased clustering procedure.
2.2 Estimation of GMMDR directions
Suppose we describe a set of observations on variables through a components Gaussian mixture model (GMM) of the form .
From a graphical point of view, we pursue the smallest subspace that can capture the clustering information contained in the data. A natural starting point would require to look for those directions where the cluster means vary as much as possible, provided each direction is orthogonal to the others. This amounts to solving the following optimization problem:
where is the betweencluster covariance matrix, is the covariance matrix with , is the spanning matrix, and is the identity matrix. The solution to this constrained optmization is given by the eigendecomposition of the kernel matrix with respect to
. The eigenvectors, corresponding to the first
largest eigenvalues, , provide a basis for the subspace which shows the maximal variation among cluster means. There are at most directions which span this subspace.This procedure is similar to the sliced inverse regression (SIR) algorithm (Li, 1991). SIR is a dimension reduction method which exploits the information from the inverse mean function (see Li, 2000, Chap. 2)
. Instead of conditioning on the response variable (or a sliced version of it if continuous), here we condition on the cluster memberships. SIR directions span at least part of the dimension reduction subspace
(Cook, 1998, Chap. 6) and they provide an intuitive and useful basis for constructing summary plots. However, they may miss relevant structures in the data when withincluster covariances are different.The SIR method (Li, 2000, Chap. 5) exploits the information coming from the differences in the class covariance matrices. The kernel matrix is now defined as:
where is the pooled withincluster covariance matrix, and directions are found through the eigendecomposition of with respect to . Albeit the corresponding directions show differences in group covariances, they are usually not able to also show location differences.
The proposed approach aims at finding those directions which, depending on the selected Gaussian mixture model, are able to display both variation in cluster means and variation in cluster covariances.
Definition 1.
Consider the following kernel matrix
(2) 
The basis of the dimension reduction subspace is the solution of the following constrained optimization:
This is solved through the generalized eigendecomposition
(3) 
Thus, the basis of the required subspace is given by the GMMDR directions .
The kernel matrix (2) contains information from variation on both cluster means and cluster covariances. For mixture models which assume constant withincluster covariance matrices, such as models E, EII, EEI, and EEE (see Fraley and Raftery, 2006, Table 1), the subspace spanned by is equivalent to that spanned by .
Remark 1.
Let be the subspace spanned by the matrix of eigenvectors from (3), and and be the means and covariance, respectively, for the th cluster. It can be easily seen that the projection of the parameters onto the subspace are given by and . Furthermore, the projection of the data matrix onto can be computed as ; we call these GMMDR variables.
Remark 2.
The raw coefficients of the estimated directions are uniquely determined up to multiplication by a scalar, whereas the associated directions from the origin are unique. Hence, we can adjust their length such that they have unit length, i.e. each direction is normalized as for . In matrix form, let , where is the matrix of eigenvectors from (3), then . For the GMMDR variables
Thus, they are uncorrelated, and with variances inversely related to the squared length of the eigenvectors of the kernel matrix, while GMMDR directions are orthogonal with respect to inner product.
We now provide some properties as propositions whose proofs are reported in the Appendix A.1.
Proposition 1.
GMMDR directions are invariant under affine transformation , for a nonsingular matrix and a vector of real values. Thus, GMMDR directions in the transformed scale are given by .
Proposition 2.
Each eigenvalue of the eigendecomposition in (3) can be decomposed in the sum of the contributions given by the squared variance of the between group means and the average of the squared within group variances along the corresponding direction of the projection subspace, i.e.
Based on this result we may provide an interpretation to the contribution of each direction to the visualization of the clustering structure. Furthermore, directions corresponding to small eigenvalues provide little or no information about differences in cluster means or covariances.
Remark 3.
Let be the sample data matrix which we assume, with no loss of generality, to have zero mean columnvectors. The sample version of (2) is obtained using the corresponding estimates from the fit of a Gaussian mixture model via the EM algorithm. Then, GMMDR directions are calculated from the generalized eigendecomposition of with respect to .
Remark 4.
The dimension of the estimated subspace is for models which assume equal withincluster covariance matrices (i.e. models E, EII, EEI, and EEE), while otherwise. The GMMDR directions are ordered based on the associated eigenvalues, so those directions corresponding to approximately zero eigenvalues can be discarded in practice since clusters largely overlap along such directions. A more formal approach to the selection of relevant directions is discussed in Section 3.
3 Subset selection of GMMDR variables
In the previous section we have discussed the estimation of GMMDR variables, which is basically a linear mapping method onto a suitable subspace. This can be viewed as a form of (soft) feature extraction, where the components are reduced through a set of linear combinations of the original variables. However, it may be possible that a subset of the estimated GMMDR variables provide either no clustering information or redundant clustering information already provided by other variables.
Consider the partitioned matrix , which forms a basis for . Based on proposition 3 of Cook and Yin (2000, p.157), if and then is a DRS. This is useful since, albeit it does not indicate how to construct , it indicates that unnecessary variables are independent from relevant variables within groups, and that they are also marginally independent from . In our case, given that unnecessary GMMDR variables provide no clustering information but require parameters estimation, we would like to detect and discard them.
In the following we consider a subset selection method for GMMDR variables following the approach of Raftery and Dean (2006)
, who proposed the use of BIC to approximate Bayes factors for comparing mixture models fitted on nested subsets of variables. A generalization of this approach has been recently discussed by
Maugis et al (2009).3.1 A criterion for feature selection
Let be the set of indices with indexing a subset of features from the original GMMDR variables (), and be the set of obtained excluding the th feature index from . The comparison of the two nested subsets can be recast as a model comparison problem and addressed using BIC difference as an approximation to Bayes factor. Following Raftery and Dean (2006), the BIC difference for the inclusion of the th feature is given by
(4) 
where is the BIC value for the “best” model fitted using features in , and is the BIC value for no clustering. The latter can be written as
i.e. the BIC value for the “best” model fitted using features in plus the BIC value for the regression of the th feature on the remaining features in . Noting that GMMDR variables are orthogonal, the general formula for (see Raftery and Dean, 2006, eq. (7)) reduces to , where is the maximum likelihood estimate of the th feature variance. In all cases the “best” clustering model is identified with respect to both the number of mixture components and model parametrization.
3.2 A greedy search algorithm for selecting the “best” subset
GMMDR variables are defined as orthogonal linear combinations of the original variables, and it is likely that only a subset of them is needed for clustering. However, the space of all possible subsets of size , with ranging from 1 to , has number of elements equal to . An exhaustive search soon becomes unfeasible, even for moderate values of . To alleviate this problem, Raftery and Dean (2006) proposed a greedy search algorithm for finding a local optimum in the model space. The search has both a forward step which evaluates inclusion of a proposed variable, and a backward step which evaluates exclusion of one of the currently included variables. The algorithm stops when consecutive inclusion and exclusion steps are rejected. Since GMMDR variables are orthogonal the search can be simplified by skipping the backward step. This allows us to considerably speed up the computing time.
The search starts by selecting the single GMMDR variable which maximizes the BIC criterion. This is equivalent to using the statistic in equation (4), with and for any . Thus, at the first step equation (4) measures the difference between the best clustering model and the single cluster model for each variable. In the following steps, we select the GMMDR variate which maximizes the BIC difference (4), so taking into account the features already included. We keep adding GMMDR variates until the BIC difference becomes negative. A detailed description of the greedy search algorithm is reported in Appendix A.2.
At each step, the search over the model space is usually performed with respect to both the model parametrization and the number of clusters. However, we may want to fix the model parameters at the estimated values obtained using the full set of features; in such a case, the order in which the GMMDR directions are selected follows the ordering of the associated eigenvalues.
3.3 An iterative procedure for feature selection
The greedy search discussed in the previous section is likely to discard some of the GMMDR variates, in particular those associated with small eigenvalues which do not carry any clustering information once other features have been included. A GMM may then be fitted on such constructed variables and the corresponding GMMDR directions estimated. The feature selection step can then be repeated, and the whole process iterated until no directions could be dropped. Depending on the number of initial features, the number of iterations required is usually small. The whole process is described in the following algorithm:
Algorithm: GMMDR estimation and feature selection process Fit a GMM on the original data. Estimate GMMDR directions using the method in Section 2.2 and project the data onto the estimated subspace. Perform feature selection using the greedy search discussed in Section 3.2. Fit a GMM on the selected GMMDR variables and go to step 2. Repeat steps 2–4 until none of the features could be dropped.4 Simulations
In this section we present a data analysis example based on a synthetic data set to describe the proposed approach, followed by some simulation studies. The adjusted Rand index (ARI), proposed by Hubert and Arabie (1985), is adopted for evaluating the clustering obtained and comparing the results arising from competing mixture models. The ARI gives a measure of agreement between two partitions, one estimated by a statistical procedure independent of the labelling of the groups, and one being the true classification. This index has zero expected value in the case of random partition, and it is bounded above by 1. Thus, higher values represent a better performance.
4.1 Synthetic data on three overlapping clusters with different shapes
Consider a simulated data set with three overlapping clusters on ten variables, with each cluster sample size equal to (
). The first three variables have been generated from a multivariate normal distribution with means
, , , and withingroups covariancesSeven noise variables, generated independently from a standard normal distribution, were also added. Thus, clustering information is only contained in the first three variables, with the remaining ones been simply noise. The correct model we hope to find is the VVV model with .
The GMM with the largest BIC is the seven clusters EEI model, closely followed by the model with six clusters. Both models have the wrong number of components, thus yielding a poor classification accuracy. Constraining the selected model improves the ARI, albeit there are still some misclassified data points (see Table 1). Using the “best” unconstrained GMM we obtained the GMMDR directions; there are of such directions, but only the first three are retained by the feature selection step. As it can be seen from Table 1, the final VVV mixture model with 3 clusters fitted on the three selected GMMDR variables is able to recover the original clustering structure.
Method  Model  BIC  ARI  

GMM (unconstrained)  EEI  7  4731.56  0.6674 
GMM (constrained)  VII  3  4820.57  0.8671 
GMMDR (3 directions)  VVV  3  1339.40  1.0000 
The left panel of Figure 3 shows the eigenvalues associated with the GMMDR directions. From such a plot we can see that the first two directions are the most important, with the first showing both difference in means and variances among clusters, while the second almost only differences in means. The last direction adds a small amount of clustering information, mostly from differences in variances. The coefficients for the selected GMMDR directions (see the right panel of Figure 3) indicate that only the first three variables seem to be required, since the coefficients for the remaining ones are almost zero.
Graphs contained in Figure 4 use GMMDR directions to convey clustering information about the fitted GMM. Plots along the diagonal show univariate component densities, while plots above the diagonal report contours of the estimated densities for each component. Such graphs clearly reflect the characteristics of the mixture model fitted; in particular, clusters have different volume, shape and orientation. Furthermore, it is quite clear that clusters on the projection subspace appear to be well separated along the first two directions, as it can be seen from the maximum a posteriori (MAP) classification areas below the diagonal of Figure 4.
4.2 Simulation studies
Here we discuss the results from some simulation studies we conducted to compare the proposed GMMDR approach to modelbased clustering performed on both the original variables and the leading principal components. The data sets used for such comparison are generated from some overlapping Gaussian mixtures with different covariance matrix structures. In addition, we investigated the effect of the inclusion of a certain number of redundant and noise variables. The former are variables correlated with those bringing clustering information, so they only provide redundant information. The latter are variables whose distribution do not depend on the cluster label, thus they only introduce some noise which tends to obscure the underlying clustering structure to be recovered.
The simulation schemes used for generating the data are described below.
Model 1: three overlapping clusters with common covariances.
In the first simulation scheme each data set was generated from three overlapping clusters with equal mixing probabilities on three variables. These were generated from a multivariate normal distribution with means , , , and common covariance matrix
To this basic setup, which correponds to a EEE mixture model (see Fraley and Raftery, 2006, Table 1), we added two further scenarios. In the first scenario (noise variables), seven noise variables were generated from independent standard normal variables. In the second scenario (redundant and noise variables), we generated three variables correlated with each clustering variable (with correlation coefficients equal to 0.9, 0.7, 0.5, respectively) and four independent standard normal variables. In both cases, there was a total of ten variables.
Model 2: three overlapping clusters with constant shapes.
In this simulation scheme we generated observations with equal prior from three classes using the mixture model VEV. The means for each class were , , and , while for the covariance matrices () the scale, shape and orientation parameters were, respectively, , , and
As for the previous model setup, we also considered two further scenarios with the addition of noise variables in the first case, and noise and redundant variables in the second case.
Model 3: three overlapping clusters with unconstrained covariances.
For this last scheme we simulated each data set from three overlapping clusters with equal mixing probabilities on three variables. Clustering variables were generated from a multivariate normal distribution with parameters set as in the synthetic example of Section 4.1, which correspond to a VVV parametrization with . Again, we also considered the effect of adding only noise variables, and noise and redundant variables.
For each data set simulated as described above we fitted a GMM with number of clusters in the range and different model parametrizations, then we selected the model with the highest BIC. A GMM was also fitted to the leading principal components computed from the correlation matrix, or equivalently using standardized variables; the number of components was chosen using the Kaiser’s rule, i.e. selecting those with eigenvalues larger than one (Jolliffe, 2002, p. 114–115). Finally, we applied the GMMDR approach discussed in Section 2.2, followed by the subset selection procedure of Section 3. For each estimated model we computed the adjusted Rand index (Hubert and Arabie, 1985) as a criterion to evaluate the clustering partitions obtained: higher values of ARI correspond to better performance.
The results of the simulation procedure based on 1000 repetitions for increasing sample sizes are shown on Table 2
, where we reported the average adjusted Rand index for the three fitting procedures to be compared and the corresponding standard errors.
When no noise variables are included, the GMMDR procedure yields equivalent clustering accuracy to GMM on the original variables. However, when irrelevant features (either noise or redundant ones) are present the GMMDR procedure significantly improves the accuracy as measured by the adjusted Rand index. This is particular evident for smaller sample sizes () when noise variables are present, and when the underlying clusters have different within cluster covariance matrices. The PCA+GMM procedure leads to uniformly small values of ARI, confirming that reducing the dimensionality of the problem through principal components does not help to discover clustering structures in the data. As expected, as the sample size increases all the three methods improve in accuracy, with the GMM method on the original variables and the GMMDR approach leading essentially to the same accuracy for .
A further aspect was also checked, the number of clusters selected by each procedure (results not shown here). Recalling that for all the simulation schemes the number of clusters was three, GMMDR almost always selects the true value, whereas GMM tends to select a larger number of components when noise variables are included and the clusters have different covariance structures. PCA+GMM performs worst than the other two methods, often underestimating the true number of clusters in the presence of noise variables and for smaller sample sizes ().
No noise  Noise variables  Noise and redundant variables  
Model 1 (EEE)  
GMM  0.9684  0.9744  0.9754  0.6869  0.8918  0.9742  0.8518  0.9695  0.9743 
(0.0420)  (0.0166)  (0.0083)  (0.1363)  (0.1217)  (0.0090)  (0.1889)  (0.0241)  (0.0087)  
PCA+GMM  0.8591  0.9105  0.9146  0.4989  0.8352  0.9091  0.5361  0.7518  0.7942 
(0.1626)  (0.0446)  (0.0221)  (0.1700)  (0.1256)  (0.0298)  (0.1342)  (0.0909)  (0.0239)  
GMMDR  0.9716  0.9742  0.9753  0.8612  0.9674  0.9742  0.8832  0.9699  0.9742 
(0.0347)  (0.0172)  (0.0088)  (0.1426)  (0.0234)  (0.0090)  (0.1613)  (0.0197)  (0.0088)  
Model 2 (VEV)  
GMM  0.9738  0.9802  0.9819  0.7316  0.6328  0.9808  0.6648  0.6811  0.9807 
(0.0302)  (0.0141)  (0.0073)  (0.0825)  (0.0793)  (0.0081)  (0.0913)  (0.1572)  (0.0078)  
PCA+GMM  0.6562  0.6789  0.7130  0.1907  0.4307  0.6458  0.4579  0.6465  0.8018 
(0.2745)  (0.2138)  (0.0824)  (0.1386)  (0.1700)  (0.1417)  (0.1780)  (0.1483)  (0.0309)  
GMMDR  0.9709  0.9802  0.9819  0.9201  0.9747  0.9806  0.9231  0.9727  0.9806 
(0.0351)  (0.0141)  (0.0073)  (0.0799)  (0.0165)  (0.0082)  (0.0811)  (0.0169)  (0.0077)  
Model 3 (VVV)  
GMM  0.9960  0.9981  0.9983  0.6937  0.8877  0.9751  0.8434  0.9708  0.9742 
(0.0115)  (0.0043)  (0.0024)  (0.1260)  (0.1189)  (0.0081)  (0.1932)  (0.0194)  (0.0092)  
PCA+GMM  0.9805  0.9896  0.9899  0.4917  0.8357  0.9074  0.5396  0.7493  0.7945 
(0.0534)  (0.0105)  (0.0053)  (0.1663)  (0.1277)  (0.0316)  (0.1353)  (0.0984)  (0.0236)  
GMMDR  0.9952  0.9981  0.9983  0.8643  0.9674  0.9751  0.8799  0.9712  0.9740 
(0.0146)  (0.0042)  (0.0024)  (0.1384)  (0.0210)  (0.0081)  (0.1609)  (0.0185)  (0.0099) 
Based on the simulation results we may conclude that when there are redundant or noise variables a suitable subset of GMMDR variables allows us to significantly improve cluster identification. The improvement is larger for small samples having different withincluster covariances and in the presence of noise variables. Overall, using GMMDR variables do not produce worse results than using original variables, but often leads to a better selection of the clustering structure. There appears no reason to perform modelbased clustering on principal components.
4.2.1 Unequal prior probabilities
The previous simulation examples assumed equal prior group probabilities. It may be of interest to investigate if the results obtained depend on such assumption. We expect that, provided there is enough cluster information in the data, no dramatic change from previous findings should be observed. Thus, we replicated the previous simulation study for the more complex Model 3 (VVV), but with unequal prior probabilities set at
. The results are shown in Table 3.When no noise variables are included the accuracy of GMM and GMMDR are essentially the same as in the equal prior probabilities case, whereas PCA+GMM performs slightly worse. If irrelevant variables (either noise or redundant ones) are present both GMM and GMMDR yield somewhat worse results when or . This seems to be due to the fact that BIC often selected models with more components than the true number of clusters. On the contrary, PCA+GMM seems to improve when redundant variables (i.e. variables correlated with clustering ones) are included. Since principal components are computed from the correlation matrix, we argued that correlated variables are likely to provide useful clustering information when some groups are represented by few observations. Overall, we note that GMMDR never performed worst than GMM, and it was also better than PCA+GMM except when irrelevant variables are included and .
Model 3 (VVV)  No noise  Noise variables  Noise and redundant variables  

GMM  0.9826  0.9991  0.9994  0.7451  0.6195  0.7407  0.8214  0.7326  0.7209 
(0.0502)  (0.0025)  (0.0010)  (0.1468)  (0.1663)  (0.1160)  (0.1433)  (0.1188)  (0.0937)  
PCA+GMM  0.8781  0.8644  0.8606  0.8305  0.8347  0.7398  0.8334  0.8434  0.8228 
(0.0970)  (0.0871)  (0.0532)  (0.0748)  (0.1050)  (0.2017)  (0.0590)  (0.0353)  (0.1054)  
GMMDR  0.9742  0.9991  0.9994  0.8733  0.8176  0.7964  0.8720  0.8073  0.8136 
(0.0734)  (0.0024)  (0.0011)  (0.1341)  (0.1378)  (0.1547)  (0.1303)  (0.1601)  (0.1600) 
4.2.2 Highdimensional data
Clustering highdimensional data is a challenging task. The main problem in fitting a GMM model is the large number of parameters which need to be estimated. To overcome this Ghosh and Chinnaiyan (2002) suggested to perform modelbased clustering on the first few principal components. Based on the conclusion of Chang (1983), discussed also in Section 2, this approach does not seem to be advisable. Bouveyron et al (2007) proposed a method for dealing with high dimensional data.
In Section 4.2 we described a three clusters VEV model (2) on 3 clustering variables. We also considered the inclusion of 3 redundant variables (i.e., correlated with the clustering ones) and 4 noise variables; there was a total of ten variables according to the scheme . This basic setup has been generalized to investigated the behavior of the proposed approach as the number of variables increase. For we generated variables following the scheme ; for instance, if , there are 30 variables, 9 of them are clustering variables, 9 are correlated with the previous ones, and the remaining 12 are noise variables. We only investigated the most difficult case, i.e. the case of a small sample size, by setting .
As the number of variables increase the accuracy of all the three methods tend to decrease, with GMMDR always more accurate on the average that the other two methods (see Table 4). However, for and the average ARI for GMMDR dropped just above . Looking in more detail the results, we realized that, even though a subset of directions could provide better cluster accuracy, BIC frequently selected too many of such directions. To find evidence of this fact we replicated the simulation study, but using the simple model EII (common spherical covariance matrix) and selecting the GMMDR directions on the basis of the entropy criterion (Celeux and Soromen, 1996). This provides a measure of the overlap of the clusters, and it is defined as , where denotes the conditional probability that the th unit arises from the th mixture component. Except for the case , where the ARI is smaller, and , where the ARI is essentially equivalent, the accuracy of GMMDR is largely increased (see Table 4). Note also that, although we constrained to fit the wrong EII model, the accuracy of GMM do not worsen.
Based on this limited study, it seems that for highdimensional data GMMDR directions are still able to recover the clustering structure of the data, but the BIC criterion overestimates how many of them are needed. When the entropy criterion is used to select relevant directions the clustering accuracy improve.
Model 2 (VEV)  

GMM  0.6788  0.6405  0.6128  0.5968  0.5859 
(0.0850)  (0.0702)  (0.0744)  (0.0726)  (0.0783)  
PCA+GMM  0.4602  0.7238  0.6928  0.6485  0.6009 
(0.1831)  (0.1655)  (0.1574)  (0.1582)  (0.1283)  
GMMDR  0.9205  0.9717  0.8447  0.6802  0.6131 
(0.0843)  (0.0951)  (0.1960)  (0.2092)  (0.1920)  
GMM (EII)  0.7101  0.6581  0.6299  0.6148  0.5979 
(0.0889)  (0.0707)  (0.0696)  (0.0753)  (0.0765)  
GMMDR  0.8447  0.9830  0.9817  0.9204  0.8381 
(0.1672)  (0.0837)  (0.0801)  (0.1528)  (0.1923) 
An extreme form of highdimensional data is the case. Clustering of gene expression data arising from microarray experiments is a typical application and modelbased approaches have been proposed (Yeung et al, 2001). In order to apply the approach discussed in this paper, a form of regularization is required for the inversion of the covariance matrix in (2), such as one of those discussed in BernardMichel et al (2009). However, these aspects require further studies.
5 Data analysis examples
5.1 Italian wines
Forina et al (1986)
reported data on 178 wines grown in the same region in Italy but derived from three different cultivars (Barolo, Grignolino, Barbera). For each wine 13 measurements of chemical and physical properties were made, such as the level of alcohol, the level of magnesium, the color intensity, etc. The data set is avalaible at UCI Machine learning data repository
http://archive.ics.uci.edu/ml/datasets/Wine. The following analyses were performed on a standardized scale.Modelling all the available variables with a GMM selects the model VEI with eight clusters, which gives a small adjusted Rand index (see Table 5). This can be largely improved by the variable selection method proposed by Raftery and Dean (2006). A comparable accuracy is also achieved by McNicholas and Murphy (2008), who applied a class of parsimonious Gaussian mixture models (PGMM) based on mixtures of factor analyzers; their selected model has latent factors and it is denoted as CUU (see McNicholas and Murphy, 2008, Table 1). We note, however, that, despite the improvement on accuracy, PGMM selects a wrong number of clusters (see Table 5).
Method  Number of  Model  ARI  

features  
GMM  13  VEI  8  0.48 
GMM (best subset)  5  VEV  3  0.78 
PGMM  2  CUU  4  0.79 
GMMDR  5  VEV  3  0.85 
(Malic, Proline, Flavanoids, Intensity, OD280)
We applied the GMMDR approach to this data set and we obtained a final VEV model with three clusters estimated on five GMMDR directions. This model yields an adjusted Rand index equal to 0.85 (see Table 5
) and the resulting confusion matrix is reported in Table
6.Cluster  

Wines  1  2  3 
Barolo  57  2  0 
Grignolino  2  64  5 
Barbera  0  0  48 
As Figure 5 shows, the first two directions mainly exhibit differences in cluster means, whereas the remaining directions mostly represent differences in cluster variances. We also note that the first three GMMDR directions appear to contain most of the clustering information.
Figure 6 shows a static view of a rotating 3D spin plot for the first three GMMDR directions with data points marked according to cluster membership; Barbera wines (3) appear to be well separated from the others along the first direction, while Barolo (1) and Grignolino (2) wines present a significant separation on the plane identified by the second and third directions.
5.2 Australian crabs data
This data set contains measurements recorded on 200 specimens of Leptograpsus variegatus crabs on the shore in Western Australia (Campbell and Mahon, 1974)
. Crabs are classified according to their colour form (blue and orange) and sex. There are 50 specimens of each sex of each species. Each specimen has 5 measurements: frontal lobe size (FL), rear width (RW), carapace length (CL) and width (CW), body depth (BD). The main interest is to distinguish between species and sex based on the available morphological measurements.
Gaussian mixture modelling performs poorly for this dataset. The model with the highest BIC selects 9 components, which gives a small adjusted Rand index. If we constrain the number of components to match the actual number of classes, the performance is still poor (see Table 7). Raftery and Dean (2006) obtained a large improvement through variable selection, ending up with a mixture model based on four variables. This dataset was also analyzed by Bouveyron et al (2007, pp. 516–517) who obtained a 5% error rate with the highdimensional data clustering (HDDC) model .
Starting with the “best” unconstrained mixture model (EEE, ) we estimated the GMMDR directions and then we performed feature selection. The final GMMDR model is fitted on three directions with 4 clusters having ellipsoidal, equal volume and shape but different orientation (EEV). The error rate is 7.50%, the same obtained by Raftery and Dean (2006). The resulting accuracy is slightly worst than that for the HDDC model of Bouveyron et al (2007), but it has the advantage of providing a visualization of the clustering structure.
Method  Model  Error  ARI  
rate (%)  
GMM (unconstrained)  EEE  9  0.4831  
GMM (constrained)  VEV  4  42.50  0.3165 
GMM (best subset)  EEV  4  7.50  0.8154 
HDDC  4  5.00  
GMMDR  EEV  4  7.50  0.8195 
(CW, RW, FL, BD) 
Summary plots (not shown here) indicate that the first direction mainly separates blue from orange type of crabs, while the second reflects sex differences. The last direction adds only marginal information contrasting blue and orange crabs within sex. Figure 7 plots the uncertainty projected onto the first two GMMDR directions, where is the estimated conditional probability that observation belongs to group . Misclassified crabs are indicated by a square surrounding the point. As it can be seen, orange and blue crabs appear well separated, whereas discrimination of sex within crabs species is more problematic: some points lie in the regions of higher uncertainty (darker areas), in particular for blue crabs. It is interesting to note that these aspects can also be readoff from the corresponding confusion matrix (see Table 8).
Cluster  

Species  1  2  3  4 
BF  50  0  0  0 
BM  12  38  0  0 
OF  0  0  47  3 
OM  0  0  0  50 
6 Conclusion
In this paper we addressed the problem of dimension reduction for modelbased clustering from the point of view of visualizing the clustering structure. Often, principal component analysis is used as a preprocessing step, and a clustering procedure is applied to the leading principal components. Probabilistic principal components and factor analyzers have been also used for dimension reduction. However, these do not necessarily capture the underlying clustering structure.
Unlike other approaches (Tipping and Bishop, 1999a, b; McLachlan et al, 2003; Bouveyron et al, 2007; McNicholas and Murphy, 2008), the proposed method is not connected to any underlying latent variables structure, nor an EMlike algorithm is employed in the estimation. Instead, we aim at identifying the subspace which capture most of the clustering information contained in the data. Estimation of directions spanning such subspace is based on the eigendecomposition of a suitable kernel matrix, with unknown parameters estimated from a GMM selected as the best description of the data at hand. Visualization techniques can then be applied using the leading directions in order to obtain summary plots. The estimated directions can be further reduced by recasting the problem of feature selection as a model choice problem and using BIC to approximate Bayes factors. A forward greedy search algorithm is discussed to avoid the search over all possible subsets.
The proposed methodology has been applied to simulated and real data sets. In all cases we were able to identify a dimension reduced subspace while preserving the cluster information contained in the original variables. Moreover, simulation studies showed that a suitable subset of directions allows for significant improvement in cluster identification. The improvement is larger when withincluster covariances are different and redundant or noise variables are present.
Finally, it is straightforward to extend the proposed approach in a supervised context, where the class membership for each unit is known in advance.
Appendix
a.1 Proofs
Proof of proposition 1
Define the transformed data matrix as , where is a nonsingular matrix, is a vector of real values, and is a vector of ones. It is easy to show that in the transformed scale and the kernel matrix is given by . The generalized eigendecompostion in the transformed scale satisfy the equation , so must be equal to the matrix of eigenvalues of the generalized eigendecomposition of with respect to , and is equal to the corresponding diagonal matrix of eigenvalues. Hence, , , and the GMMDR directions in the transformed scale are given by . Furthermore, the corresponding variates , so they are simply a shifted version of the GMMDR variates in the original scale.
Proof of proposition 2
For any kernel matrix we may rewrite the eigendecomposition in (3) as . Since by definition , the diagonal matrix of eigenvalues can be expressed as .
We start by considering the kernel matrix , for which . Thus,
where is the projection onto the subspace spanned by the columns of . So, for a SIR kernel matrix each eigenvalue is equal to the variance of the between group means along the th direction. Consider the kernel matrix in (2), for which the diagonal matrix of eigenvalues can be written as
We note that since we have . Thus, the first part of the above equation may be written as
For the following part we have
Therefore, the matrix of eigenvalues can expressed as , and proposition 2 follows.
a.2 Greedy search algorithm for feature selection
In this section we provide a detailed description of the greedy search algorithm used to select the “best” subset of GMMDR variates.
The BIC criterion (Schwartz, 1978) is the basic building block for choosing the number of clusters and model parametrization (Fraley and Raftery, 1998). In general, for a finite mixture model with components it is computed as
where is the maximized loglikelihood, is the number of parameters estimated, and is the number of observations. The difference between BIC values for two models is approximately equal to twice the logarithm of the Bayes factor for the comparison of the two models (Kass and Raftery, 1995). Raftery and Dean (2006) investigated variable selection in modelbased clustering, recasting the problem as a model selection procedure. We adapt their approach for GMMDR variable selection.
Let be the set of GMMDR variables computed for a mixture model following the method described in Section 2.2. The criterion used for model comparison is the BIC difference in equation (4). The steps of the greedy search algorithm are the following:

Select the first variable to be the one which maximizes the BIC difference criterion. Let be the candidate set, and be the set of already included variables, which is of course empty at the beginning. We select the variable such that
At the first step , thus maximization is taken over the clustering models and with , where is the maximum number of clusters to be considered for the data. Then, define , set and go to the next step.

Select a variable to include, among those not already included, to be the one which maximizes the BIC difference criterion. Formally, let be the candidate set, and be the set of already included variables. We choose the variable to be included such that
In this case the “best” model is identified with respect to both the number of mixture components up to , and model parametrization. Then, update the subset of currently included variables, i.e. .

Set and iterate the previous step until a stopping rule is met. The algorithm naturally terminates when all variables are included, but it might be stopped earlier when, for example, the maximal BIC difference for the inclusion of a variable becomes negative.
As mentioned, the proposed greedy search is a forwardtype algorithm. Given the orthogonality of the GMMDR variables, a backward step is not required.
References
 Banfield and Raftery (1993) Banfield, J., Raftery, A.E.: Modelbased Gaussian and nonGaussian clustering. Biometrics 49, 803–821 (1993)
 BernardMichel et al (2009) BernardMichel, C., L., G., Girard, S.: Gaussian regularized sliced inverse regression. Stat. Comput. 19(1), 85–98 (2009)
 Biernacki et al (2000) Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE T. Pattern. Anal. 22(7), 719–725 (2000)

Bishop and Tipping (1998)
Bishop, C.M., Tipping, M.E.: A hierarchical latent variable model for data visualization.
IEEE T. Pattern. Anal. 20(3), 281–293 (1998)  Bouveyron et al (2007) Bouveyron, C., Girard, S., Schmid, C.: Highdimensional data clustering. Comput. Stat. Data An. 52, 502–519 (2007)
 Campbell and Mahon (1974) Campbell, N., Mahon, R.: A multivariate study of variation in two species of rock crab of genus leptograpsus. Aust. J. Zool. 22, 417–425 (1974)
 Celeux and Govaert (1995) Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781–793 (1995)
 Celeux and Soromen (1996) Celeux. G., Soromenho, G.: An entropy criterion for assessing the number of clusters in a mixture model. J. Classif. 2, 195–212 (1996)
 Chang (1983) Chang, W.: On using principal components before separating a mixture of two multivariate normal distributions. J. Roy. St. Soc. C Appl. Stat. 32(3) (1983)
 Cook (1998) Cook, R.D.: Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley, New York (1998)
 Cook and Yin (2001) Cook, R.D., Yin, X.: Dimension reduction and visualization in discriminant analysis (with discussion). Aust. Nz. J. Stat. 43, 147–199 (2001)
 Dempster et al (1977) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm (with discussion). J. Roy. Stat. Soc. B Met. 39, 1–38 (1977)
 Forina et al (1986) Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201 (1986). URL ftp://ftp.ics.uci.edu/pub/machinelearningdatabases/wine. Wine Recognition Database

Fraley and Raftery (1998)
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via modelbased cluster analysis.
Comput. J. 41, 578–588 (1998)  Fraley and Raftery (2002) Fraley, C., Raftery, A.E.: Modelbased clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
 Fraley and Raftery (2006) Fraley, C., Raftery, A.E.: MCLUST version 3 for R: Normal mixture modeling and modelbased clustering. Tech. Rep. 504, Department of Statistics, University of Washington (2006)
 Ghosh and Chinnaiyan (2002) Ghosh, D., Chinnaiyan, A.M.: Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18(2), 275–286 (2002)
 Hubert and Arabie (1985) Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
 Jolliffe (2002) Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002)
 Kass and Raftery (1995) Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)
 Li (1991) Li, K.C.: Sliced inverse regression for dimension reduction (with discussion). J. Am. Stat. Assoc. 86, 316–342 (1991)
 Li (2000) Li, K.C.: High dimensional data analysis via the SIR/PHD approach. Unpublished manuscript (2000)
 Maugis et al (2009) Maugis, C., Celeux, G., MartinMagniette, M.L.: Variable selection for Clustering with Gaussian Mixture Models. To appear in Biometrics (2009)
 McLachlan and Peel (2000) McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
 McLachlan et al (2003) McLachlan, G., Peel, D., Bean, R.: Modelling highdimensional data by mixtures of factor analyzers. Comput. Stat. Data An. 41, 379–388 (2003)
 McNicholas and Murphy (2008) McNicholas, P., Murphy, T.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
 Raftery and Dean (2006) Raftery, A.E., Dean, N.: Variable selection for modelbased clustering. J. Am. Stat. Assoc. 101(473), 168–178 (2006)
 Schwartz (1978) Schwartz, G.: Estimating the dimension of a model. Ann. Stat. 6, 31–38 (1978)
 Tipping and Bishop (1999a) Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999)
 Tipping and Bishop (1999b) Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Roy. Stat. Soc. B Met. 61, 611–622 (1999)
 Yeung et al (2001) Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Modelbased clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001)
Comments
There are no comments yet.