In many real applications, observations are subject to missing entries during the data collection process. Thus, handling these missing values is a crucial point in model estimation. This aspect is also fundamental in pattern recognition, where missing values can be informative, and an inaccurate treatment can lead to serious errors in classification and density estimation. Before focusing on the methods available for dealing with missing data, it is necessary to specify the nature of the missing mechanism(Rubin:1976)
. This describes how the probability of a missing value is related to the data and, although in general it is unknown and unobservable, it is necessary to make some assumptions about the underlying missing mechanism. Often the validity of the methods used in practical applications depend on whether or not these assumptions are fulfilled. Three different missing mechanisms are common in the literature: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). For the technical details and definitions we refer toLittle:Rubin:2002. We note, however, that under the MCAR and MAR assumptions the missing mechanism is considered to be ignorable, i.e. model parameters are not affected by the distribution of the missing data.
Different approaches for dealing with missing values can be found in the literature. In the listwise/casewise deletion approach observations with missing entries are removed, and models are estimated using only the full set of observed data (Little:Rubin:2002, Chap. 3). This approach is simple and fast to implement, but the estimates may be biased due to the loss of information caused by the removal of part of the data. Another popular approach is to impute the missing observations, that is somehow fill in the missing data with a single value or a set of values. Single imputation methods are the mean imputation, which fill the missing values with the mean or the conditional mean (Wilks:1932), the regression imputation (Buck:1960), which replaces missing values with predicted scores from a regression model, and its stochastic version (Little:Rubin:2002, Sec. 4.3), the hot-deck imputation (Ford:1983), which is a collection of techniques that impute the missing values with scores from similar observations. Multiple imputation refers to a set of procedures that fill the missing values with different plausible values, thus generating different complete datasets. Then, each complete dataset is analysed and the results are combined to reflect the uncertainty due to the presence of missing values (rubin1987multiple; schafer1997analysis). Finally, model-based procedures assume a specific distribution for the observed data and draw inference on parameters based on the likelihood. A popular method for Maximum-Likelihood in missing-data problems is the Expectation-Maximisation (EM) algorithm (dempster1977maximum).
Gaussian Mixture Models (GMMs) are a powerful tool for density estimation, clustering, and classification (fraley2002model; mclachlan2000finite). Parsimonious parametrisation of covariance matrices allows for a flexible class of models, each with its distinctive characteristics (banfield1993model; celeux1995gaussian). However, the resulting log-likelihood is difficult to maximise directly, even numerically (see McLachlan and Peel, 2000, Sect. 2.8.1), so GMMs are usually fitted by reformulating the mixture problem as an incomplete-data problem within the EM framework. Since the missing observations may contain important information about the clustering structure of the data, ghahramani1995learning extended the EM algorithm for GMMs with missing values. They derived a closed-form solution for the M-step, but only in the unconstrained case, that is for full within-cluster covariance matrices. However, no solutions are available for the parsimonious parametrisations of covariance matrices mentioned above.
In this paper we extend the work of ghahramani1995learning by considering all the parsimonious covariance structures proposed in banfield1993model and celeux1995gaussian. In our proposal, we use a modified E-step based on augmented data, followed by maximisation of the complete-data log-likelihood using the standard M-step for GMMs. To this goal, we have developed two model-based approaches to handle missing data in GMMs. Assuming that the data are distributed as multivariate Gaussian within each cluster, Monte Carlo sampling allows to approximate the E-step using an augmented dataset with filled missing values. The complete-data log-likelihood of the augmented dataset can then be maximised using standard formulas for GMMs. The main advantage of this approach is that only the E-step is modified, whereas the M-step for each parsimonious parametrisation of the covariance matrices is not affected.
The paper is organised as follows. Section 2 contains a brief description of Gaussian mixtures and the standard EM algorithm. Section 3 introduces two approaches for handling missing data in Gaussian mixture models, with details on the corresponding Monte Carlo EM algorithms presented in Section 4. In Section 5 the performance of the proposed algorithms are evaluated on both simulated and real datasets in terms of both clustering and density estimation. The final section provides some concluding remarks.
2.1 Gaussian Mixture Models
Gaussian Mixture Models (GMMs) are a popular parametric model for cluster analysis and density estimation. This class of models has been widely used for different problems in several fields for its flexibility and interpretability. Extensive reviews of mixture models can be found inmclachlan2000finite and in fraley2002model for the applications of GMMs to cluster analysis, discriminant analysis, and density estimation.
Let be a random sample of observations with
. The observed data are assumed to arise from a Gaussian mixture distribution if the density can be written as a convex combination of multivariate Gaussian distributions such that:
where is the number of mixture components, is the underlying density function of the th component, a multivariate Gaussian distribution with mean and covariance matrix , are the mixing weights, such that and , and is the set of unknown parameters of the mixture model.
The Gaussian assumption implies ellipsoidal clusters, each centred at the mean vectorand with covariance matrices . The latter control the geometrical features of the ellipsoids, such as the orientation, the volume and the shape. Parsimonious parameterisation of covariance matrices for GMMs can be achieved through the eigen-decomposition , where is a scalar controlling the volume, is a diagonal matrix controlling the shape, and
is an orthogonal matrix controlling the orientation of the ellipsoids. This parameterisation generates a broad class of models with specific geometric properties(banfield1993model; celeux1995gaussian).
Maximum likelihood estimation of the unknown parameters of a mixture model can in principle be obtained by maximising the log-likelihood:
However, the objective function above is hard to maximise directly, even numerically (see mclachlan2000finite, Sect. 2.8.1). Consequently, the standard algorithm employed for parameters estimation in mixture models is the Expectation-Maximisation (EM) algorithm, which is briefly reviewed in the following subsection.
2.2 EM algorithm for mixture models
dempster1977maximum introduced an iterative procedure to estimate the parameters of a mixture model by maximising the log-likelihood by alternating two different steps. The Expectation step (E-step) computes the expected value of complete-data log-likelihood, and the Maximisation step (M-step) maximises the expected value previously computed with respect to the parameters of the mixture model. The algorithm guarantees a non-decreasing log-likelihood and, under fairly general conditions, it converges to at least a local maximum (mclachlan2008algorithm).
Suppose there exists an unobservable process , where each is a latent variable specifying the component-label, i.e. if the -th observation belongs to component and otherwise.
Let be the joint density of the complete-data vector, formed by the observed variable and the latent variable . Then, the complete-data log-likelihood can be written as
The E-step computes the expected value of the complete-data log-likelihood in (1) with respect to the latent variable and using the current fit for , the so-called Q-function:
The M-step updates the estimate of by maximising the Q-function in (2), such that:
In the GMMs case, the complete-data log-likelihood is
At iteration of the EM algorithm, the Q-function is
so the E-step computes the expected value of as
The parameters of the mixture components are updated in the M-step by maximising
with respect to the parameters for . The procedure is repeated until convergence, and the estimated parameters for are returned. celeux1995gaussian provide details of the M-step for different covariance parameterisations, some of which have closed-form solutions, while others required numerical procedures.
3 Missing data in Gaussian mixture models
Using the previous notation, the single observation is partitioned as , where is the observed part, and is the missing part of the data vector. Note that, because the missing pattern for each observation could be different, the subscript should be included on the superscripts and , but for ease of presentation and reading we avoid that notation. If the vector is assumed to be Gaussian, that is , with
where , then, by the conditional property of the Gaussian distribution, we may write the conditional distribution of the missing part given the observed part of the data as
Thus, the conditional distribution of the missing part given the observed part follows a multivariate Gaussian distribution, with and .
In the GMMs context, the underlying density function of each component is assumed to be a multivariate Gaussian distribution, that is . In the presence of missing values, we may write:
with , , and
Furthermore, by the conditional property of the Gaussian distribution, the joint distribution of the observed part and the missing part given the group membership can be factorised as the product of two Gaussian distributions:
ghahramani1995learning proposed an extension of the standard EM algorithm to deal with missing values (see also hunt2003mixture). The complete-data log-likelihood in this case can be written as
Thus, the conditional expectation in the E-step takes the following form:
In this case there are two sources of missingness, one related to the unknown classification, and one connected with the missing part of the data. Therefore, additional expectations must be introduced to solve the E-step. By solving these expectations, it is possible to obtain a closed-form expression for the M-step under the assumption of unconstrained within-component covariance matrices (ghahramani1995learning; hunt2003mixture; eirola2014mixture). However, the flexible parameterisations of the covariance matrices described in Section 2.1 have not been taken into account. For this reason, we aim at proposing methods to solve the -function in (4) for the general family of parsimonious GMMs.
3.2 Proposed methods
In this paper we propose two versions of a Missing Monte Carlo EM (MMCEM) approach. Both versions are based on the idea of using a Monte Carlo EM algorithm (MCEM; wei1990monte) to approximate the expected values required in the E-step.
Starting from equation (4), the first idea is to directly use MC approximations to solve the additional expectations generated from the presence of missing values. The expected complete-data log-likelihood can be approximated as follows:
where is the MC sample size, is the simulated indicator variable, and is the imputed value. Comparing equation (3) and equation (5), it is clear that the latter is the complete-data log-likelihood of a GMM for the augmented dataset with dimension .
Drawing indicator variables for each observation from the conditional distribution , the missing part is imputed times using the conditional distribution , i.e. conditional on the simulated group membership and on the observed part , to obtain . We refer to this method as MMCEM1.
The second approach employs the MC approximation in a different way, together with the law of total expectation. For a general density function , the E-step of the MMCEM1 method described above is approximated as
Since equation (6) can be rewritten as
the inner expected value can be computed using MC approximations, and the outer expected value can be written in closed-form, i.e.
Using the approach in equation (7), the expected complete-data log-likelihood can be approximated as follows:
where is the expected group membership indicator variable for observation computed using the conditional distribution of , and is the missing part of observation which is imputed times for each group . Also in this case, equation (8) is the complete-data log-likelihood of the augmented dataset. We refer to this method as MMCEM2.
Note that, since the standard M-step is unchanged, both approaches introduced in this Section allow to estimate the parameters for all the parsimonious GMMs obtained by adopting the different parameterisations of the covariance matrices. Next section provides further details concerning the proposed algorithms.
4 MMCEM algorithms
The approximations discussed in the previous Section yield two different methods for computing the E-step; see equation (5) and equation (8). This section describes further computational details concerning the proposed extensions of the EM algorithm for GMMs with missing values.
4.1 MMCEM1 algorithm
Initialisation is an important step in any iterative algorithm, especially when the function to maximise is multimodal, as in the case of GMMs. Proportions, means, and covariance matrices of the mixture components must be provided for starting the EM algorithm. They can be estimated from the fully observed dataset obtained after removing the missing observations. Thus, the classification matrix is initialised using the observed part of each observation:
The imputation of the missing values is performed directly in the E-step using the following steps:
At iteration , draw for each from Multinomial distribution with parameter:
where is the th draw from the conditional probability of observation to belong to the cluster , given the observed part of the data and the previous estimates of the parameters. Then, for observations with missing values only the observed part is considered.
After simulating the classification matrix, the missing values are imputed. Each missing values is simulated from for .
The new augmented dataset is given by the union of the imputed datasets , and the new classification matrix is given by the union of the classification matrices drawn at step 1. For instance, the generic observation containing at least one missing value can be represented as
Then, the new dataset and are used in the standard M-step.
The parameters and the classification matrix, obtained according to the MAP principle for assigning the observations to a given cluster, are computed using the standard M-step for GMMs using the augmented data and the classification matrix .
4.1.4 Iterations and convergence check
Since standard convergence criteria for the EM algorithm cannot be used in this case because of MC error, iterations and convergence check are performed as described below:
EM steps are run for iterations, and only the parameters associated to the largest value of the log-likelihood are stored.
In the following EM steps, the parameters are updated when the log-likelihood improves compared to the previous step.
The algorithm is stopped if the log-likelihood does not increase for successive iterations.
Two tuning parameters must be set in the above algorithm. If the initial number of “warm-up” iterations is large, the algorithm potentially explores a large part of the parameters space at the cost of increasing the computing time, and viceversa when is small. The second parameter specifies the number of “stalled iterations” allowed, so we can control how stringent is the adopted criterion for declaring the convergence of the algorithm. A sensible choice of these tuning parameters is needed to balance the efficiency and accuracy of the MCEM.
4.2 MMCEM2 algorithm
4.2.1 Initialisation, convergence and M-step
As in the MMCEM1 algorithm, the imputation is performed during the E-step. The algorithm used to build the augmented dataset is the following:
For each observation, simulate values from
Construct an augmented data set where each observation is represented as follows:
In this Section, the proposed algorithms are evaluated on both simulated and real datasets to assess their performance in terms of clustering and density estimation.
The software package mclust, freely available on CRAN (https://cran.r-project.org/web/packages/mclust/index.html) for the R language (RStat), provides fitting of finite mixture of Gaussian distributions through the use of the EM algorithm (mclust2). In particular, the function
mstep() can be used to perform the maximisation step for each of the fourteen models in the parsimonious GMM family generated by the eigen-decomposition of the covariance matrices discussed in Section 2.1. This function, together with our code that implements the two versions of the E-step, have been used to build the algorithms described in Section 4.
The methods included in the comparison are:
EM algorithm with MC approximations of the E-step as presented in Section 4.1 (MMCEM1).
EM algorithm with MC approximations of the E-step as presented in Section 4.2 (MMCEM2).
Multiple imputation (schafer1997analysis), where different imputed datasets are generated, and for each of these the standard EM algorithm is applied to estimate the density and the final clustering (GMMMI). The
imputeData()function in the mclust package is used to impute the missing values, and the
Mclust()function is used to estimate GMMs on the imputed dataset. A label switching strategy is also implemented using the majority vote to assign the observations to the different clusters. The number of multiple imputations is set at .
Gaussian mixture on the original dataset before introducing the missing values (GMM). The
Mclust()function from mclust package is used to estimate the parameters of the Gaussian mixture. These estimates represent the benchmark to which the different methods should tend in the presence of missing values; the closer the values are to these estimates, the better a method can recover the missing information.
MC sample size is one of the most important tuning parameter in our proposed algorithms. This parameter must balance the precision of the method and the computational efficiency. Large MC sample sizes imply a higher probability of convergence to the true value, and therefore greater precision. In contrast, small MC sample sizes improve the speed of the algorithm to the detriment of the accuracy of the estimates. In our experiments we set the MC sample size to . This relatively small value provides a conservative assessment of the precision and efficiency of the MMCEM1 and MMCEM2 algorithms. Moreover, the “warm-up” parameter is set at , and the “stalled iterations” parameter is set at . Since the largest improvements of log-likelihood are likely to happen in the initial iterations of the EM algorithm, we tried to replicate conditions similar to the standard EM algorithm. Clearly, larger MC sample sizes and larger values of the tuning parameters could only improve model fitting, at the cost of increasing the computing effort.
5.1 Synthetic data
Simulated datasets are generated from eight different scenarios using a mixture of Gaussian distributions with number of variables and number of mixture components . Details for each scenario are the following:
Three well-separated clusters from a Gaussian mixture with mean vector for each component given by , , , and common covariance matrix:
This correspond to model EEE in the mclust nomenclature. The mixing probabilities are all equal to .
Three well-separated clusters with strongly unbalanced mixture of Gaussians with prior probabilities set to, and with the remaining parameters set as in the previous scenario.
Three-groups case with two overlapping clusters having mean vector for each component given by , , and , and common covariance matrix:
This corresponds to model EEE in the mclust nomenclature. The mixing probabilities are all equal to .
Three-groups case with two overlapping clusters with strongly unbalanced clusters with prior probabilities set to , and with the remaining parameters set as in the previous scenario.
Three well-separated clusters with mean for each component given by , , , and unconstrained covariance matrices:
This corresponds to model VVV in the mclust nomenclature. The mixing probabilities are all equal to .
Three well-separated clusters with strongly unbalanced mixture of Gaussians with prior probabilities set to , and with the remaining parameters set as in the previous scenario.
Three-groups case with two overlapping clusters having mean vector for each component given by , , , and unconstrained covariance matrices:
This corresponds to model VVV in the mclust nomenclature. The mixing probabilities are all equal to .
Three-groups case with two overlapping clusters with strongly unbalanced clusters and prior probabilities set to , and with the remaining parameters set as in the previous scenario.
Scenarios (a)-(b)-(e)-(f) are relatively simpler cases compared to scenarios (c)-(d)-(g)-(h). In the latter cases, because clusters have substantial overlap, detecting the exact number of components can be particularly difficult. The unbalanced cases in scenarios (b)-(d)-(f)-(h) are used to assess the effect of missing values when some clusters have small sizes.
A missing mechanism is applied to each dataset with missing percentage set to about , i.e. approximately half of the observations have at least a missing value. Details differ depending on whether MCAR or MAR mechanism is used.
For the MCAR mechanism, incomplete observations are randomly selected, and for each observation the variable to contain the missing value is also selected at random. This two steps guarantee that the mechanism is MCAR because the missing values are a random sample of all data values.
To generate data under the MAR mechanism the following process is adopted. Let be a matrix of indicator variables, such that if is missing, and 1 otherwise. From the definition in schafer1997analysis, the missing values are supposed to be MAR if:
However, to replicate a MAR mechanism it is necessary to estimate the probabilities in (9). Since we are generating two-dimensional datasets, without loss of generality, we may assume that the first variable is completely observed, and the second variable contains all the missing values. The probability of a missing value on the second variable for the th observation can be modelled using an inverse logit transformation:
for all . The value
guarantees that on average about 50% of the observations have a missing value. Then, these probabilities are used to randomly select from a Bernoulli distribution those observations that have a missing value in the second variable.
The performance of the methods under investigation is evaluated in terms of both density estimation and clustering accuracy. To assess the accuracy of density estimates the Kullback-Leibler divergence(KL; kullback1951information)
is used. This is a dissimilarity measure between two probability density functions. Letand be two density functions, then the KL divergence is defined as follows:
where if the two densities are equal. In general, and gets larger as the diversity between the two densities increases. The density is taken to be the true density of the simulated data, whereas is the density estimated using one of the methods under comparison. Then, to have a good density estimation method, the KL should be as much as possible close to zero. Since GMMs do not have a closed-form expression for (10), a MC approximation must be used (hershey2007approximating).
To compare the estimated classification with the true clusters, the Adjusted Rand Index (ARI; hubert1985comparing) is used. This measures the agreement between two partitions, with expected value 0 in case of random partitions, and a value equal to 1 in case of a perfect agreement. Thus, the true partition is compared with the classification provided by the GMM methods under comparison in the presence of missing values.
Models are estimated either by fixing the number of components and the parameterisation of component-covariances used to generate the data, and then by selecting the number of mixture components by the the Bayesian Information Criterion (BIC; schwarz1978estimating). A last configuration is considered when both the number of components and the parsimonious decomposition of component-covariance matrix are selected by BIC. The last two situations allow to investigate the influence of missing values on GMMs parameters estimation when either the number of groups is not known a priori, or the component-covariance matrix, or both.
In general, the proposed methods clearly outperform the multiple imputation approach in all scenarios in terms of density estimate accuracy. To this goal, MMCEM1 and MMCEM2 are essentially equivalent and close to the estimates obtained using the complete dataset (GMM). The same also applies to cluster identification, with few exceptions where the three methods are roughly comparable. Multiple imputation (GMMMI) appears to be no worse than MMCEM1 and MMCEM2 only when clusters are unbalanced and overlapping.
When the number of groups is selected by BIC, the proposed methods again perform better than the multiple imputation approach, both in terms of cluster accuracy and density estimation. In particular, by removing the number of clusters the MMCEM methods outperform in term of classification the multiple imputation also in the overlapping case with unbalanced clusters. In addition, as expected, GMM selects three groups, whereas MMCEM1 and MMCEM2 select three components in the majority of cases. Conversely, the multiple imputation approach (GMMMI) tends to select more components that the other methods in the well separated clusters, and less components in the overlapping clusters (see Figures 7–8).
If the covariance matrix is constrained to be equal among the components, the multiple imputation approach tends to select more clusters than the original groups. This may be due to the imputation step that generates points that fill the gaps between the clusters. As a consequence, imposing the ellipsoids to be equal increases the number of mixture components required.
When both the number of components and the covariance model are unknown and selected by the BIC criterion, our MMCEM methods tend to outperform the multiple imputation, with values close to the estimates provided by the GMM on the original data. This suggests that our methods seem to be able to recover the original structure of the data.
Another element arises from the simulations. The MMCEM2 algorithm appears to perform better than the MMCEM1 algorithm, with smaller standard errors, indicating a more precise method. Such behaviour can be noted also in the distribution of the number of estimated components in Figures7–8; in most cases MMCEM2 selects the correct number of clusters, whereas MMCEM1 has much more variability in selecting the number of components.
5.2 Diabetes data
provided data from a diabetes study conducted at the Stanford Clinical Research Center. Blood chemistry measurements were recorded on 145 non-obese adult subjects, namely the area under plasma glucose curve, the area under plasma insulin curve, and the steady state plasma glucose level. After further analysis, the patients were classified into three groups: a group of patients suffering from overt diabetes (Overt), a group affected by chemical diabetes (Chemical), and a last group made of patients without diabetes (Normal). The dataset is available in the R packagemclust.
Missing values are introduced according to the MCAR mechanism under two different missing data patterns. In the first data pattern scenario, a single missing value is randomly assigned to a given percentage of sample observations. By setting this percentage at approximately and , we get, respectively, and observations with a single missing value out of observations. In the second data pattern scenario, the percentage of missing values refer to the overall number of measurements. Hence, setting the percentage to approximately and , the total number of missing values randomly entered into the data matrix are and , respectively, out of total measurements.
In this real data analysis example, we aim at comparing the clustering performance of the GMM fitted on the original data, as in the simulation studies, against the proposed methods, i.e. MCEM1 and MCEM2, the multiple imputation approach, and the GMM fitted on the data obtained after removing the observations with at least one missing value (GMMD). The performance of these methods are evaluated using the ARI. Furthermore, the BIC criterion is employed to select both the number of clusters and the parsimonious within-component covariance matrices.
Figure 9 shows the results for the Diabetes data after applying the missing procedure times outlined above. In the first missing data scenario, where missing values are randomly assigned to observations, so each row of the data matrix has at most one missing value, the methods perform in a similar way, and they are pretty close to the case of GMM estimated on the full dataset (see left panel of Figure 9). In the second scenario, where the percentage of missing values is distributed over the entire dataset, the proposed MMCEM methods appear to outperform the other methods based on listwise-deletion or multiple imputation (see right panel of Figure 9).
In this paper we proposed two different algorithms to estimate GMMs in the presence of missing values by exploiting the well-known EM algorithm. Both algorithms employ Monte Carlo methods during the E-step to build augmented datasets via stochastic missing values imputation. These are then used in the standard M-step, thus allowing to obtain parameters estimates for any parsimonious component-covariance matrix structure available for Gaussian mixtures.
In general, the proposed methods seem to outperform the multiple imputation approach, both in terms of clustering and density estimation. The MMCEM1 and MMCEM2 algorithms are basically equivalent when the number of mixture components and the covariance structure are known. When these are unknown and, consequently, are selected by BIC, the MMCEM2 procedure provides more accurate clustering partitions and density estimates.
On the other hand, the proposed algorithms are highly demanding in terms of computational cost. For high-dimensional dataset the procedures could need a large number of iterations during the model estimation phase, because of data augmentation in the imputation step that strongly depends on the number of observations and the sample size of the MC approximation. For this reason a more efficient method could be investigated. In addition, since the proposed methods are based on the MAR and MCAR assumptions, another future development might consider MNCAR missing mechanism.