Cluster analysis partitions data into groups of similar observations in an unsupervised manner and commonly without knowledge of the total number of clusters. Clustering applications appear in many fields, including medical imaging , gene expression , microbiome studies , crime analysis , and astronomy . Prominent approaches to cluster analysis can be grouped into centroid-based methods such as -means 7], and model-based methods . Refer to [9, 10, 11, 8, 12, 13] for a comprehensive introduction to the rich topic of clustering.
In practice, real datasets may have missing values or otherwise partially observed records that complicate the validity and application of standard statistical methodology. Missingness may result from diverse causes, with an underlying mechanism of one of three types: missing completely at random (MCAR), missing at random (MAR), or not missing at random (NMAR) 
. Under MCAR, the probability that a case (record, sample, or observation) is missing feature (variable, attribute, dimension) values does not depend on either the observed or missing feature values. The mechanism is MAR when this probability depends on the observed feature values, butnot the missing feature values, and NMAR when the probability depends on both observed and missing feature values. Notably, MCAR data are also MAR; if the data are not MAR, then they are NMAR. Strategies for analysis of data with missing values are often critically dependent on the missingness mechanism, and clustering is no exception.
For clustering problems, the most common (and often expedient) treatment of missing values is deletion, on either a case or feature basis, or imputation [15, 16]. Given a dataset with cases and features, case deletion removes all cases with any missing values across the features, leading to a reduced dataset with cases that are fully observed for all the original features. After a clustering algorithm has been applied to the resulting reduced dataset of complete cases, the remaining incomplete cases can be assigned to the obtained cluster partition, for example, by using a partial distance 
or marginal posterior probability approach. An alternative deletion approach is executed on a feature-wise basis by discarding any features that are not observed for all cases, resulting in a dataset of cases but features, on which a clustering algorithm can be applied to directly cluster all cases . While attractive for their ease of implementation, both data exclusion schemes make an assumption of a MCAR mechanism, violation of which leads to reduced clustering performance. Even if data are MCAR, deletion may yield poor clustering performance due to loss of information. In contrast, imputation approaches [20, 21, 22, 23] for clustering replace each missing value with a predicted value to produce a completed dataset that can be supplied to the desired algorithm to cluster all cases. Critically, this approach treats the imputed values as if they were observed values, and thus ignores any error and uncertainty associated with the fact that they are not the actual values. Obtaining a suitable method for imputation can be difficult because the most appropriate choice likely depends on the unknown cluster partition. As a consequence, use of imputation has been shown to substantially diminish clustering performance .
The drawbacks of deletion and imputation have prompted clustering approaches that incorporate the missing data structure, yet use all the observed entries in the dataset.  and  used a partial distance to measure the distance between differentially observed cases.  extended the fuzzy
-means algorithm for incomplete cases but after imposing soft constraints based on estimating the distance between incomplete cases and cluster centers. In fuzzy clustering,[27, 28, 29] utilized complete cases to define cluster centers and weights and multiply imputed the incomplete cases in the objective function. The -means extension of  also used soft constraints defined by the partially observed features, but this requires at least one feature to be observed across all cases. The -POD algorithm  employs majorization-minimization  to minimize the objective function of -means using partial distances for incomplete cases. Recently,  developed the -means algorithm, which generalized the -means algorithm of 
to include partially-observed records. These approaches all utilized the Euclidean metric for measuring the distance between observations and cluster centers, in the process assuming hyperspherical-shaped clusters and giving up on robustness to outliers.
Model-based clustering using finite mixtures of the multivariate Gaussian [35, 36] or distributions [37, 38] allow for hyperellipsoidal-shaped clusters  through use of the Mahalanobis distance 
and have a long history of successful application. Compared to the Gaussian distribution, thedistribution confers greater resistance to outliers through its wider tails, and is therefore often the de facto choice for model-based clustering. For clustering of incomplete data,  proposed finite mixture modeling using multivariate distributions and designed an expectation-maximization (EM) algorithm  for both estimating mixture model parameters and treatment of the missing values.  extended this work to incorporate the eigen-decomposed covariance structure of . To better fit asymmetrical-shaped clusters, 
developed an approach using skew-distributions while  extended this work with a eigen-decomposed covariance structure for skew--and generalized hyperbolic distributions. Importantly, all of these works included the missing values in the formal incomplete dataset within an EM algorithm. This may be computationally burdensome and lack robustness to the MAR assumption since, in the expectation step (E-step), the observed feature values inform those that are missing. When the data are NMAR, the observed values are not directly informative of those that are missing because their values are related to their own missingness.
In this paper, we (Section II) propose model-based clustering of partially recorded data using finite multivariate- mixture distributions for only the observed values by integrating out the missing values and excluding them from our clustering algorithm calculations. An alternating expectation-conditional maximization (AECM) algorithm  implements our approach and is seen in simulation studies (Section III) to reduce computational complexity when compared to a full EM approach and also to do well under violations of the MAR mechanism. Section IV uses our methodology to cluster gamma-ray bursts data. We conclude with some discussion in Section V. Two appendices have additional details.
Ii-a Background and Preliminaries
We begin by introducing our model and relevant notation for the problem of clustering the dataset consisting of cases and features into clusters, allowing for missing values in . For now, we treat as known, postponing discussion on choosing to later. Assume that the cases are independent and arise from a finite mixture of distributions described by the density where is the proportion represented by the cluster with , and is the -variate -density with mean , positive-definite real dispersion matrix
, and degrees of freedomdefined as
where is the Mahalanobis distance  given by
In missing data problems, the -dimensional records are only partially observed and we leverage the observed values for clustering. WLOG, suppose each
-vector is decomposed into observed and missing components aswhere is the observed component and is the missing component for each observation . Define observed and missing component selection matrices and , respectively, such that extracts the observed component from and has dimension , and , of dimension , extracts the missing component from . Then These two matrices together account for the entire -dimensional vector through the property . Omitting the vacuous case where no features are observed, there are unique patterns of missingness possible for each case. The marginal density  of the observed values in the th observation record is
where and are the mean and dispersion of the observed components of the th record in the cluster.
Our development above uses as a non-stochastic selection matrix to specify the presence of measurement for the th record, and as a computational device. As pointed out by a reviewer, a more accurate development would involve incorporating the stochastic nature of to capture the missingness mechanism if such were available. However, incorporating such a mechanism is application-specific, and so for this paper, we have assumed that the missingness mechanism is unavailable to us, and inherently MCAR.
Ii-B EM Algorithm for Parameter Estimation
Ii-B1 Complete data and log likelihood
A EM algorithm for parameter estimation can be formulated using (2) by specifying a so-called complete dataset and corresponding log likelihood. Since missing values are omitted in (2), the “actual” missing values are not part of the complete dataset; instead we include the “conceptual” missing values of class memberships and characteristic weights that we now introduce. Finite mixture modeling approaches to clustering can be recast as a problem of missing cluster membership labels. To this end, we define the latent class membership indicators
where denotes the indicator function. As an unsupervised learning task, all the ’s are missing. To devise an EM algorithm, we also utilize the multivariate Gaussian-gamma mixture formulation of the multivariate distribution , rewriting (1) as
where is the -variate Gaussian distribution with mean
and variancedenoted by , and
is the gamma distribution with shape and rate parameters both given by, and denoted by
. The random variablecan be understood as a latent characteristic weight. This provides a hierarchical specification of (2) that is defined by
where is the vector of latent class labels and is the characteristic weight for the observation.
Together, we take to be the complete data, disregarding the missing values . The corresponding complete data log likelihood for the parameters is, but for an additive constant, given by
Ii-B2 An AECM algorithm for parameter estimation
We now design an AECM algorithm for maximum likelihood estimation of all the model parameters assuming the complete data log likelihood function in (5). The AECM approach differs from a general EM approach in that it breaks down each iteration by partitioning the parameter space into blocks, and cycles through the partition by alternating between updating each block of parameters through a conditional maximization (CM-step) and an E-step. For fully observed data, our approach reduces to that of  for unconstrained dispersion matrices.
The Q-function, given previous iteration parameter estimates , is where (excluding constant terms) we have
with as the digamma function, as the (current iteration) posterior probability that the th record belongs to cluster , and is the (current iteration) conditional expectation of given and . The later weights the influence of in estimation of and .
In the E-step, given current parameter estimates , we obtain the updates
To define the CM-steps, we form the parameter space partition following . Our computation in the CM-steps makes use of missingness indicator vectors
defined for . In the first CM-step, we update the ’s according to , where the numerator can be understood as representing an estimated (current iteration) sample size from the th group. Also in the first CM-step, for computational simplicity, we can update the s, following derivations in Appendix A, by
if it increases , or we keep the current value (however, see the comment below). Compared to updates for using fully observed data, the missingness indicators in the RHS of (6) only add element-wise contributions for observed values. Similarly, the LHS term serves as an element-wise number-of-observations adjustment for the number of observed values in each feature. The first CM-step then updates the s as the solution to
There is no closed-form solution to (7). Our R package MixtClust offers a numerical solution using Brent’s method  and, by default, also extends the closed-form approximation introduced by  that uses
by modifying to only use the observed number of features , in contrast to , for the observation:
The second CM-step updates (see Appendix A for derivations and explanations), also for computational reasons, suggests updating ’s as
if is increased, or we keep the current value for (see, as before, the comment below). where denotes Hadamard (element-wise) inverse, denotes Hadamard product, and
denotes tensor product. The missingness indicators play similar roles as withs, but here they operate on the elements of a matrix rather than a -vector. Our suggested updates therefore provide for a generalized EM algorithm [42, 52].
The check for the at each of the two CM steps can be quite expensive, so our implementation instead checks for an increase in the loglikelihood at each cycle, terminating the algorithm at the previous iteration if there is a reduction in the total loglikelihood. The calculation of the loglikelihood is needed to be made at each cycle to assess convergence, so does not increase computational cost.
Comparison to full EM
Our approach uses marginalization, therefore excluding consideration of the missing data values in the formulation of the incomplete dataset for EM style algorithms and so does not utilize the distribution of missing values conditional on the observed values, i.e., , which contrasts to the approach of . While their approach uses a different parameter space partition, the primary difference is that their CM-step updates for and (general covariance structure) replaces (6) and (8) by
are updated in the E-step. Our approach only performs computations on the observed values in the dataset, rather than on all values, in the updates for the ’s and ’s in this full EM approach. We also avoid computing (9) and (10) altogether, which in the full EM approach need to be updated between every CM-step. While imputation approaches also avoid evaluating the equations, they perform computations on values. Therefore, our method by design has fewer computations and is faster than all comparative methods that account for missing values. However, our algorithm is also prone to premature termination, with effects that we account for in the next section.
Ii-B3 Initialization and convergence assessment
EM algorithms and their variants such as AECM find solutions in the vicinity of their initialization, with convergence leading to the discovery of a local, but not necessarily global, solution. Consequently, good starting values are important for its good performance.  provided the em-EM algorithm that runs the EM algorithm (“em” or “short EM”) from multiple starting points to lax convergence and then chooses the solution with the highest loglikelihood and runs the EM algorithm to strict convergence. A modification  proposes eschewing the potentially expensive loglikelihood check at each “em” step, running each “em” attempt for a small fixed number of iterations, and then runs the top few ones with highest loglikelihood for the long run and then choosing the one that upon convergence the estimate produces the largest log likelihood. We use this “Modified em-EM” approach  in our AECM algorithm for the full EM computations, with short runs and 10 long runs. However, our marginalization approach does not guarantee increases in the loglikelihood, so we adopt the Rnd-EM algorithm of , that eliminates the “em” steps altogether. We address the case of premature termination by having a large number (here, 10, to allow for fair comparison across methods, but perhaps much more) of long runs. We choose the initial seeds in the same manner as  and follow with one run of the EM cycle. Beyond checking for decreases in the loglikelihood (for the marginalization case), we use a lack-of-progress criterion to assess algorithm convergence, stopping when for a desired small (we use ), where is the observed-data log likelihood resulting from (2) and is the estimate of at the AECM iteration.
Ii-C Determining the number of clusters
The AECM algorithm presented here assumes a known or number of clusters. However, this is rarely the case in applications, and we precede by formulating the choice of as a model selection problem. Commonly, the Bayesian information criterion (BIC)  is used to discriminate between competing models for a given dataset, and is defined by where is the maximized observed-data log likelihood and is the number of free parameters to be estimated. We choose by choosing, from among a range of candidate values, the one attaining the smallest BIC at convergence. This approach of using BIC to determine the number of clusters in finite mixture models is well-established (see, e.g.[57, 58, 8]).
Iii Performance Assessment
This section reports performance evaluations in simulation experiments of our methodology relative to competing methods.
Iii-a Simulation study design
We simulated data with varying clustering complexities determined by the generalized overlap of and maximum eccentricity of  and implemented in the R package MixSim . The generalized overlap , denoted here by , adopted ideas from  to arrive at a one-point measure of clustering complexity, specifically, a numerical summary of the overall overlap between pairs of clusters. Higher values of the generalized overlap correspond to increased cluster overlap and consequently higher clustering complexity. Eccentricity controls the shape of the clusters, and is specified as , where and
correspond to the smallest and largest eigenvalues of the dispersion matrix. Taking values in, a perfect hypersphere has eccentricity
whereas a perfect hyperplane has eccentricity. Our simulations considered two clustering complexities: high () and low (). For each complexity level, we simulated 100 complete datasets with features, clusters, cases.
The degrees of freedom were set to be for all clusters in our simulation experiments.
Given each full synthetic dataset, we deleted values according to the four missingness mechanisms, of MCAR, MAR, NMAR1, and NMAR2, to produce partially recorded datasets for comparing the competing clustering algorithms. For the MCAR setting, we randomly removed values across the entire dataset so that each element had the same probability of being missing for and . Under MAR, we randomly removed values in only the first two features according to MCAR, as in . Our experiments considered two versions of NMAR. The NMAR1 version of  had one cluster being fully observed with the other two clusters having MCAR observations. The NMAR2 version [31, 33]
also had one cluster fully observed, but the other clusters have the appropriate bottom quantile of each feature removed so as to achieve an overall missingness level. (For our experimental evaluations, we ensured that there were at least complete records from each cluster, to allow for methods such as deletion to work.) For the first set of experiments, where our comparisons also included imputation methods, we had an overall proportion of missingness of . For the more thorough, comprehensive evaluations that only considered model-based clustering methods, we used . This translated to between 14-16%, 25-31%, 30-43% and 39-55% of observations in our simulated datasets having at least one incomplete record. Figure 1 represents a sample dataset in the high clustering complexity scenario with the four patterns of missingness, for .
Iii-B Comparison methods and evaluation
From now on, for ease of reference, we label our proposed approach as “Observed EM”, the full EM approach as “Full EM” and the case deletion approach, that only makes use of complete cases, as “Complete Case.” We also consider three imputation schemes to produce completed datasets before applying our software: Amelia II, mi, and mice. The Amelia II approach of  assumes the data are multivariate Gaussian and combines the EM algorithm with bootstrapping to draw from the posterior of the complete data parameters that are then used for imputation. In contrast, the mice and mi methods of  and , respectively, make use of chained equations to impute missing values. For all imputation approaches, we used default settings to generate completed datasets and performed clustering using the average of the imputed values. We evaluated clustering performance by comparing the the true cluster partition to that obtained by each method at the BIC-determined best (from now on denoted by ) via the Adjusted Rand index (ARI) . The (unadjusted) Rand index  is a measure of class agreement taking value in
, with a value of one indicating perfect agreement. Under random classification, the Rand index has an expected value greater than zero, reflecting the fact that, by chance alone, random classification could correctly classify some observations. The ARI is a modification of the Rand index that, in contrast, has expected value of zero under random classification while retaining the property that a value of one corresponds to perfect classification.
Iii-C Simulation results
Iii-C1 Clustering performance
We evaluate our clustering methods in terms of -selection and in terms of the ability of our methods to obtain the true partition. Figure 2 displays a summary of our results. We consider accuracy of BIC in selecting the number of clusters in Figure (a)a. Under high clustering complexity, most algorithms tended to select too few, rather than too many, clusters. However, in general, and except for being edged out by the “Complete Case” in NMAR2 for both clustering complexity situations and MAR for the low clustering complexity scenario, our “Observed EM” algorithm usually deviated the least from the true . All three imputation methods generally selected more than the true , and this was exaggerated under low clustering complexity. Within each pattern of missingness and clustering complexity, our proposed approach (“Observed EM”) provided competitive recovery of the true number of clusters. The cluster partition at was next compared to the true class memberships using the ARI, and is summarized in Figure (b)b. Overall, our approach produced cluster partitions at least as, or more, closely aligned with the truth under low clustering complexity for all missingness mechanisms. On the other hand, for high clustering complexity problems, our method is only the best overall under NMAR2, and for the remaining three missingness mechanisms for which the MAR assumption holds or is not as severely violated, our approach is only surpassed by “Full EM”.
The results of our simulation experiments show good performance of our “Observed EM” procedure. In cases with high clustering complexity, “Full EM” performs better than our case. However, even here, “Observed EM” is quite competitive. We now evaluate both methods against each other and “Complete Case” in a more comprehensive set of experiments.
Iii-C2 Comprehensive comparisons between “Observed EM”, “Full EM” and “Complete Case” methods
Evaluations at the true
Our first set of evaluations assumed that the true was known.
We report performance for the three mehods in Figure 3. The discussion at the end of Section II-B2 demonstrated that the number of calculations needed to conduct one iteration of our “Observed EM” approach is no more (and often much less) than those needed for “Full EM”. However, because the trajectories taken by the two methods can be different, we also evaluated the runtime of each algorithm. Figure (a)a shows that the time taken to termination is almost always smaller for the “Observed EM” case as opposed to the “Full EM” case, and this advantage increases with higher clustering complexity and amounts of missingness. (Interestingly, the “Complete Case” is slower than “Observed EM” despite its computations being often done on substantially smaller datasets: we attribute this finding to the difficulty of estimation in smaller datasets.) Our speed also does not come at a cost to the ability of “Observed EM” to recover the true partition, for Figure (b)b shows that our method is quite competitive, and sometimes even slightly better than with “Full EM”. An interesting question that arises is whether the choice of initializer (Modified em-EM for “Full EM” but Rnd-EM, by necessity, for “Observed EM”) plays a role. Appendix B shows no appreciable difference between the two initialization approaches on the speed and performance of “Full EM”.
Evaluations with unknown
Our next set of evaluations required to be estimated from , for which we used BIC, following Section II-C.
Figure 4 shows relatively good performance of “Observed EM” in estimating .
Indeed, Figure (a)a shows that even here, “Observed EM” is substantially faster than “Full EM” and that, at least in the cases of low clustering complexity, is the best overall performer (Figure (b)b). In the case of high clustering complexity, this trend only holds for the NMAR2 case, and with lower proportions of missingness for the other three missingness mechanisms. For MCAR, MAR, and NMAR1 missingness mechanisms, the performance of “Observed EM” relative to “Full EM” is more mixed with higher proportions of missingness.
A reviewer has asked us for insight into the reasons for the better performance of “Observed EM” over “Full EM” even when assumptions are far from MCAR. We surmise the reasons for this. Indeed, “Full EM” provides guaranteed local maxima of the EM (AECM) algorithm. Our modification does not, suffering, as it does, from possible premature termination. However, the objective log likelihood function, especially in the case of non-MCAR mechanisms, is not a completely accurate representation of the sampling mechanism that generated the data, and therefore, maximizing it may not be the golden bullet that solves the clustering problem. To evaluate if there indeed is evidence in support of this hypothesis, we compared the optimal BIC for “Full EM” and “Observed EM” and the ARI. (The BIC is a surrogate for the goodness of fit under the assumed loglikelihood model and the ARI is an indicator of cluster recovery and performance.) Figure 6 displays the results. We see that often “Full EM” is able to achieve a higher BIC than “Observed EM” but fails to get more credit for cluster recovery (per ARI) and this mismatch increases correspondingly with proportions of missingness in data. Of course, this does not answer the question as to why “Observed EM” does better in terms of cluster recovery. Our hypothesis is that this happens because our CM-step updates for s and
s use method-of-moments estimators and do not condition on the observed data. When the data are NMAR, the observed values are not directly informative of those that are missing because their values are related to their own missingness. We suspect that this aspect in our algorithm is what is allowing for better performance. Finally, finite mixture models are an indirect approach to clustering where parameter estimation and model selection happen first, and then we have cluster assignment as a byproduct from the calculation of posterior probabilities of classification. Nevertheless, our performance evaluations establish “Observed EM” as a faster and competitive alternative to “Full EM”. We now apply it to the problem of characterizing gamma ray bursts.
Iv Discovering the distinct kinds of Gamma Ray Bursts
Gamma-ray bursts (GRBs) are the brightest electromagnetic bursts known to occur in space and emanate from distant galaxies. Since their discovery, several causes of GRBs have been proposed [66, 67, 68, 69] and the existence of multiple sub-types [70, 71, 72, 73] hypothesized. To elucidate the origins of GRBs, it is of interest to determine the number and defining characteristics of these groups. Early work classified GRBs using one or two features, often using only burst duration . It was argued that more variables were needed to fully account for the observed data structure [5, 75], leading to recent interest in clustering GRBs using more features. Subsequent analyses [19, 18, 76, 77] established five groups in the GRB dataset obtained from the most recent Burst and Transient Source Experiment (BATSE) 4Br catalog.
The BATSE 4Br catalog is the most comprehensive database of the duration, intensity, and composition of 1,973 GRBs, but the records are subject to missing values encoded as zeros [19, 18], leading to a total of 1,599 GRBs that are complete cases. There are up to nine features for each GRB, namely , , , , , , , , and , where denotes the time by which % of the flux arrive, denotes the peak fluxes measured in bins of milliseconds, and represents the fluence in the th spectral channel. Due to the extreme right-skewness in the distribution of these variables, we apply the customary base-10 logarithm transformation to all the variables, and for brevity, omit the logarithm in subsequent descriptions.  however incorporated data-driven transformations in their analysis to address the skew. The two duration variables and are observed for all 1973 GRBs. The three peak flux measurements are only missing in one GRB, while , , and are missing values in 29, 12, 6 and 339 GRBs 
. Multivariate analysis of the GRBs has so far largely focused on the 1599 GRBs with complete records[19, 18, 76, 77, 5, 75]. On the other hand,  ignored the peak fluxes and the
features and the 44 GRBs that were missing values for the other features and performed Gaussian mixture-model-based clustering for the 1929 GRBs and came up with three types of GRBs. They also tried to explain the results of in the context of their findings. However, the analysis of  on the complete dataset showed all nine variables to have clustering information.  also used the maximum marginal posterior probability to classify the remaining incomplete cases to the obtained cluster partition: however, this approach assumes that all the clusters are represented in the complete sample, an assumption that may be untenable, especially when the data are not MCAR. Therefore, it is of interest to include GRBs with partial records in our analysis and our development in this paper helps facilitate that investigation.
Since the imputation approaches performed poorly in our simulation assessments, we restricted our attention to the observed, full, and complete case EM style approaches, again with default settings. The “Observed EM” and “Full EM” methods preferred clusters, as per BIC among candidates ranging from , in contrast to previous reports of only two or three clusters obtained using a few features or the five clusters obtained using the complete dataset. Our methodology, applied on the 1599 complete GRB cases, also found five or six groups, though the latter was not conclusive, following , so we decide on five groups.  also got five groups using the teigen software . This is reassuring even though our method differs from that of teigen insofar as it allows for , to vary across groups. We now discuss the results for the 7-groups solutions (and briefly, for the complete case 5-groups scenario).
Table I presents the estimated cluster proportions and degrees of freedom for the BIC-preferred K obtained using the three methods.
The three solutions disagree mildly in the estimated mixing proportions (but “Observed EM” tracks “Full EM” fairly well). The degrees of freedom are also different in a few cases. While the teigen solution in  found (all five s to be 200), our “Complete Case” solution found much fatter-tailed -mixture components. Figure 7 displays the estimated cluster means
and scale parameters of the multivariate -components. We note minor differences in the means and the scale parameters, with the “Observed EM” and “Full EM” cluster parameters being qualitatively the same. For the seven-groups solution, the “Full EM” had slightly higher log likelihood than the others, however, the main astrophysical properties of the solution is similar for both “Observed EM” and “Full EM” groups so we only discuss the “Observed EM” results.
 provided a novel approach to describing the properties of GRBs. This approach, also adopted by [19, 18, 76, 77], uses the average duration (), total fluence (), and spectral hardness () to characterize the GRBs. (Note that these calculations use the GRB features in the original scale.) Using these values, we can classify the seven GRB “Observed EM” groups as ultra-long/very bright/hard, short/faint/intermediate, ultra-long/intermediate/soft, short/intermediate/intermediate, ultra-long/bright/hard, short/faint/soft, and long/faint/soft in terms of their average duration/fluence/hardness. Further analysis of our results is outside the purview of this paper, but our groups are able to characterize GRBs more distinctly compared that in [19, 18, 76, 77] that used only the complete cases.
In this paper, we consider model-based clustering of partially recorded or otherwise incomplete data using all and only the observed values through the use of an observed data model. A corresponding approximate AECM algorithm for clustering of partially recorded data is developed and implemented in the R package MixtClust. When fitting finite mixtures of distributions to incomplete data for the purpose of clustering, integrating over the missing components has several benefits compared to complete case analysis or including the missing components in an EM algorithm: fewer computations are required in each EM iteration and the approach in our experiments appears to offer somewhat greater resistance to severe violations of the MCAR assumption. Based on the simulation experiments of Section III, we conclude that our approach is efficient and robust when compared to the corresponding complete case analysis and full EM based on finite mixture modeling with multivariate distributions. We also use our methodology to characterize GRBs in the BATSE 4Br catalog into seven sub-types with distinct and interpretable astrophysical properties.
Further consideration of the relative strengths and weaknesses between “Full EM” and our “Observed EM” approach is warranted. In particular, we need to get a better understanding for the reason why “Observed EM” performs better than “Full EM” with increasing deviations from MAR assumptions. The “Full EM” approach utilizes information on the observed values to inform the missing values, and we surmise that this information is beneficial for clustering performance when it is not (too) wrong, in contrast with the NMAR2 setting where “Observed EM” does better. However, as noted by the reviewer, we also believe that incorporating the missingness mechanism in the development of the modeling and the inference is a better approach and should be adopted: however, this is application-specific and requires knowledge of the specific mechanism causing the missingness. In the absence of information on missingness, we find that our approach does well, and is also faster, not only in terms of per iteration, but also in terms of the time to convergence. Further, our simulation experiments used the same number of initialization steps and convergence criteria for all methods, without regard to the fact that our “Observed EM” approach is by design faster than “Full EM”, so it would be interesting to compare performance with times set to be the same, and to see if we can recover the lost ground to “Full EM” using more initializations in the cases where we are currently out-performed. Also, while using the distribution for each cluster accommodates outliers, at least relative to a Gaussian distribution, it assumes that the clusters are ellipsoidally symmetric about their centers. Such an assumption may be unrealistic in practice, where clusters could be asymmetric. A natural extension of our work would incorporate skew- distributions for such cases or, alternatively, employ a symmetrizing transformation such as the Box-Cox transformation considered in finite mixture modeling by . Several other lines of improvement are possible for our method. First, we only consider general covariance structure dispersion matrices, but in actuality a simpler structure may be adequate. Accordingly, future work will incorporate a family of eigen-decomposed covariance structures . Second, we use a lack-of-progress criterion to assess convergence but alternative strategies may be better, for example, use of Aitken’s acceleration  to compute an asymptotic estimate of the log likelihood as proposed by . Finally, while we use BIC to select the optimal number of clusters, this does not account for the classification uncertainty in the fitted model as considered by the integrated completed likelihood (ICL) criterion of . Thus, we see that while we have made some contributions to the goal of model-based clustering of partial records, a number of issues remain that merit further attention.
The authors thank the Editor, the Associate Editor and two reviewers whose careful review of an earlier version of this manuscript greatly improved its content. Our thanks also to Somak Dutta and Carlos Llosa-Vite for helpful discussions. R. Maitra acknowledges support from the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Hatch projects IOW03617 and IOW03717. The content of this paper however is solely the responsibility of the authors and does not represent the official views of the NIFA or the USDA.
The GRB dataset used in this application is available at https://github.com/emilygoren/MixtClust.
Appendix A Derivations for the AECM steps
To reduce computational complexity, we may use the method-of-moments update for :
where is invertible if for , and then check to ensure if the function increases with this update, or revert back to the current value if it does not.
For the dispersion updates, we considering each pattern of missingness and use the selection matrices to expand to dimension ,
which does not provide a closed-form solution. We again propose a method-of-moments estimator for . Specifically, we propose the update
and accept it if increases, or revert back to the current solution.
Appendix B Initialization Effect on “Full EM”
The experiments of Section III-C2 demonstrated the good performance of “Observed EM” in terms of speed: the method was also shown to be competitive
Evaluating Rnd-EM versus Modified em-EM initialization methods on the speed and performance of the “Full EM” algorithm. Each split violin has the Modified em-EM on the left (lighter) lobe and the Rnd-EM on the right lobe. The bars denote the three quartiles of each violin.
in terms of clustering performance. Therefore, we performed another set of experiments, at the true that used “Full EM” with the Rnd-EM  and Modified em-EM  evaluations. Both methods used the same number of initializations () and ten long EM runs, but Modified em-EM also used ten “em” steps. Figure 8 displays the results through a split violin-plot  and indicates that there is not much difference in the distribution of either the time to convergence or in cluster recovery ability. We attribute the similar speed of Rnd-EM (with its zero “em” steps) to the fact that it appears to have to go through more iterations in each long EM run.
-  R. Maitra, “Clustering massive datasets with applications to software metrics and tomography,” Technometrics, vol. 43, no. 3, pp. 336–346, 2001.
-  D. M. Witten, “Classification and clustering of sequencing data using a poisson model,” The Annals of Applied Statistics, vol. 5, no. 4, pp. 2493–2518, 12 2011.
-  P. J. Turnbaugh, R. E. Ley, M. Hamady, C. M. Fraser-Liggett, R. Knight, and J. I. Gordon, “The human microbiome project,” Nature, vol. 449, no. 7164, p. 804, 2007.
J. Agarwal, R. Nagpal, and R. Sehgal, “Crime analysis using k-means clustering,”International Journal of Computer Applications, vol. 83, no. 4, 2013.
-  E. D. Feigelson and G. J. Babu, “Statistical Methodology for Large Astronomical Surveys,” in New Horizons from Multi-Wavelength Sky Surveys, ser. IAU Symposium, B. J. McLean, D. A. Golombek, J. J. E. Hayes, and H. E. Payne, Eds., vol. 179, 1998, p. 363.
-  J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proceedings of the Fifth Berkeley Symposium, vol. 1, pp. 281–297, 1967.
-  J. H. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, pp. 236–244, 1963.
-  V. Melnykov and R. Maitra, “Finite mixture models and model-based clustering,” Statistics Surveys, vol. 4, pp. 80–116, 2010.
-  J. A. Hartigan and J. Hartigan, Clustering algorithms. New York: Wiley, 1975, vol. 209.
-  J. R. Kettenring, “The practice of cluster analysis,” Journal of classification, vol. 23, pp. 3–30, 2006.
-  R. Xu and D. C. Wunsch, Clustering. NJ, Hoboken: John Wiley & Sons, 2009.
-  P. D. McNicholas, Mixture model-based classification. Chapman and Hall/CRC, 2016.
C. Bouveyron, G. Celeux, B. T. Murphy, and A. E. Raftery,
Model-Based Clustering and Classification for Data Science: With Applications in R. Cambridge Series in Statistical and Probabilistic Mathematics, 2019.
-  D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.
-  R. J. Little and D. B. Rubin, Statistical analysis with missing data. John Wiley & Sons, 2014, vol. 1.
-  K. L. Wagstaff and V. G. Laidler, “Making the most of missing values: Object clustering with partial data in astronomy,” in Astronomical Data Analysis Software and Systems XIV, vol. 347, 2005, p. 172.
-  R. J. Hathaway and J. C. Bezdek, “Fuzzy c-means clustering of incomplete data,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 31, no. 5, pp. 735–744, Oct 2001.
-  S. Chattopadhyay and R. Maitra, “Multivariate -mixture-model-based cluster analysis of BATSE catalogue establishes importance of all observed parameters, confirms five distinct ellipsoidal sub-populations of gamma-ray bursts,” Monthly Notices of the Royal Astronomical Society, vol. 481, no. 3, pp. 3196–3209, 07 2018.
-  ——, “Gaussian-mixture-model-based cluster analysis finds five kinds of gamma-ray bursts in the BATSE catalogue,” Monthly Notices of the Royal Astronomical Society, vol. 469, no. 3, pp. 3374–3389, 2017.
-  S. Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate imputation by chained equations in R,” Journal of statistical software, vol. 45, no. 3, 2011.
-  A. R. T. Donders, G. J. van der Heijden, T. Stijnen, and K. G. Moons, “Review: a gentle introduction to imputation of missing values,” Journal of clinical epidemiology, vol. 59, no. 10, pp. 1087–1091, 2006.
-  J. Honaker, G. King, and M. Blackwell, “Amelia ii: A program for missing data,” Journal of statistical software, vol. 45, no. 7, pp. 1–47, 2011.
-  L. A. F. Park, J. C. Bezdek, C. Leckie, R. Kotagiri, J. Bailey, and M. Palaniswami, “Visual assessment of clustering tendency for incomplete data,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3409–3422, Dec 2016.
J. K. Dixon, “Pattern recognition with partly missing data,”IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 10, pp. 617–621, Oct 1979.
-  Q. Zhang and Z. Chen, “A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data,” International Journal of Distributed Sensor Networks, vol. 10, no. 5, p. 430814, 2014.
-  M. Sarkar and T.-Y. Leong, “Fuzzy -means clustering with missing values,” in Proceedings of American Medical Informatics Association Annual Symposium (AMIA), 2001, pp. 588–592.
-  K. Simiński, “Clustering with missing values,” Fundamenta informaticae, vol. 123, no. 3, pp. 331–350, 2013.
-  ——, “Rough fuzzy subspace clustering for data with missing values.” Computing & Informatics, vol. 33, no. 1, 2014.
-  ——, “Rough subspace neuro-fuzzy system,” Fuzzy Sets and Systems, vol. 269, pp. 30–46, 2015.
-  K. Wagstaff, “Clustering with missing values: No imputation required,” in Classification, Clustering, and Data Mining Applications, D. Banks, L. House, F. McMorris, P. Arabie, and W. Gaul, Eds. Springer, 2004, pp. 649–658.
-  J. T. Chi, E. C. Chi, and R. G. Baraniuk, “k-pod: A method for k-means clustering of missing data,” The American Statistician, vol. 70, no. 1, pp. 91–99, 2016.
-  K. Lange, MM Optimization Algorithms. SIAM, 2016.
-  A. Lithio and R. Maitra, “An efficient k-means-type algorithm for clustering datasets with incomplete records,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 11, no. 6, pp. 296–311, 2018.
-  J. A. Hartigan and M. A. Wong, “A -means clustering algorithm,” Applied Statistics, vol. 28, pp. 100–108, 1979.
-  J. D. Banfield and A. E. Raftery, “Model-based Gaussian and non-Gaussian clustering,” Biometrics, vol. 49, pp. 803–821, 1993.
-  G. Celeux and G. Govaert, “Gaussian parsimonious clustering models,” Computational Statistics and Data Analysis, vol. 28, pp. 781–93, 1995.
-  G. McLachlan and D. Peel, Robust cluster analysis via mixtures of multivariate t distributions. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 658–666.
-  D. Peel and G. McLachlan, “Robust mixture modeling using the t distribution,” Statistics and Computing, vol. 10, p. 339:348, 2000.
-  B. Lindsay, Mixture models: theory, geometry and applications, 1995.
-  P. C. Mahalanobis, “On the generalized distance in statistics.” National Institute of Science of India, 1936.
-  H. Wang, Q. Zhang, B. Luo, and S. Wei, “Robust mixture modeling using multivariate t distribution with missing information,” Pattern Recogn. Lett., vol. 25, pp. 701–710, April 2004.
-  A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood for incomplete data via the EM algorithm (with discussion),” Jounal of the Royal Statistical Society, Series B, vol. 39, pp. 1–38, 1977.
-  T.-I. Lin, “Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition,” Computational Statistics & Data Analysis, vol. 71, pp. 183 – 195, 2014.
-  W.-L. Wang and T.-I. Lin, “Robust model-based clustering via mixtures of skew-t distributions with missing information,” Advances in Data Analysis and Classification, vol. 9, no. 4, pp. 423–445, 2015, cited By 3.
-  Y. Wei, Y. Tang, and P. D. McNicholas, “Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data,” Computational Statistics & Data Analysis, vol. 130, pp. 18–41, 2019.
-  X. Meng and D. van Dyk, “The EM algorithm — an old folk-song sung to a fast new tune (with discussion),” Journal of the Royal Statistical Society B, vol. 59, pp. 511–567, 1997.
-  T.-I. Lin, H. J. Ho, and P. S. Shen, “Computationally efficient learning of multivariate t mixture models with missing information,” Computational Statistics, vol. 24, no. 3, pp. 375–392, Aug 2009.
-  E. A. Cornish, “The multivariate t-distribution associated with a set of normal sample deviates,” Australian Journal of Physics, vol. 7, pp. 531–542, 1954.
-  J. L. Andrews, J. R. Wickins, N. M. Boers, and P. D. McNicholas, “teigen: An R package for model-based clustering and classification via the multivariate t distribution,” Journal of Statistical Software, vol. 83, no. 1, pp. 1–32, 2018.
-  J. L. Andrews and P. D. McNicholas, “Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions,” Statistics and Computing, vol. 22, no. 5, pp. 1021–1029, Sep 2012.
-  R. P. Brent, “An algorithm with guaranteed convergence for finding a zero of a function,” The Computer Journal, vol. 14, no. 4, pp. 422–425, 1971.
-  G. McLachlan and T. Krishnan, The EM Algorithm and Extensions, 2nd ed. New York: Wiley, 2008.
-  C. Biernacki, G. Celeux, and G. Govaert, “Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models,” Computational Statistics and Data Analysis, vol. 413, pp. 561–575, 2003.
-  R. Maitra, “On the expectation-maximization algorithm for Rice-Rayleigh mixtures with application to estimating the noise parameter in magnitude MR datasets,” Sankhyā: The Indian Journal of Statistics, Series B, vol. 75, no. 2, p. 293–318, 2013.
-  ——, “Initializing partition-optimization algorithms,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, pp. 144–157, 2009.
-  G. Schwarz, “Estimating the dimensions of a model,” Annals of Statistics, vol. 6, pp. 461–464, 1978.
-  J. L. Andrews, P. D. McNicholas, and S. Subedi, “Model-based classification via mixtures of multivariate t distributions,” Computational Statistics and Data Analysis, vol. 55, no. 1, pp. 520 – 529, 2011.
-  C. Fraley and A. E. Raftery, “MCLUST version 3 for R: Normal mixture modeling and model-based clustering,” University of Washington, Department of Statistics, Seattle, WA, Tech. Rep. 504, 2006.
-  R. Maitra and V. Melnykov, “Simulating data to study performance of finite mixture modeling and clustering algorithms,” Journal of Computational and Graphical Statistics, vol. 19, no. 2, pp. 354–376, 2010.
-  V. Melnykov, W.-C. Chen, and R. Maitra, “Mixsim: An r package for simulating data to study performance of clustering algorithms,” Journal of Statistical Software, vol. 51, no. 12, pp. 1–25, 2012.
V. Melnykov and R. Maitra, “CARP: Software for fishing out good clustering
Journal of Machine Learning Research, vol. 12, pp. 69 – 73, 2011.
-  R. Maitra, “A re-defined and generalized percent-overlap-of-activation measure for studies of fMRI reproducibility and its use in identifying outlier activation maps,” Neuroimage, vol. 50, no. 1, pp. 124–135, 2010.
-  Y.-S. Su, A. Gelman, J. Hill, and M. Yajima, “Multiple imputation with diagnostics (mi) in R: Opening windows into the black box,” Journal of Statistical Software, vol. 45, no. 2, pp. 1–31, 2011.
-  L. Hubert and P. Arabie, “Comparing partitions,” Journal of classification, vol. 2, no. 1, pp. 193–218, 1985.
-  W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, pp. 846–850, 1971.
-  T. Chattopadhyay, R. Misra, A. K. Chattopadhyay, and M. Naskar, “Statistical evidence for three classes of gamma-ray bursts,” The Astrophysical Journal, vol. 667, no. 2, p. 1017, 2007.
-  T. Piran, “The physics of gamma-ray bursts,” Rev. Mod. Phys., vol. 76, pp. 1143–1210, Jan 2005.
-  M. Ackermann, M. Ajello, K. Asano, W. Atwood, M. Axelsson, L. Baldini, J. Ballet, G. Barbiellini, M. Baring, D. Bastieri et al., “Fermi-lat observations of the gamma-ray burst grb 130427a,” Science, vol. 343, no. 6166, pp. 42–47, 2014.
-  B. Gendre, G. Stratta, J. Atteia, S. Basa, M. Boër, D. Coward, S. Cutini, V. D’Elia, E. Howell, A. Klotz, and L. Piro, “The ultra-long gamma-ray burst 111209a: the collapse of a blue supergiant?” The Astrophysical Journal, vol. 766, no. 1, p. 30, 2013.
-  A. Shahmoradi and R. J. Nemiroff, “Short versus long gamma-ray bursts: a comprehensive study of energetics and prompt gamma-ray correlations,” Monthly Notices of the Royal Astronomical Society, vol. 451, pp. 126–143, Jul. 2015.
-  E. P. Mazets, S. V. Golenetskii, V. N. Ilinskii, V. N. Panov, R. L. Aptekar, I. A. Gurian, M. P. Proskura, I. A. Sokolov, Z. I. Sokolova, and T. V. Kharitonova, “Catalog of cosmic gamma-ray bursts from the KONUS experiment data. I.” Astrophysics and Space Science, vol. 80, pp. 3–83, Nov. 1981.
-  J. P. Norris, T. L. Cline, U. D. Desai, and B. J. Teegarden, “Frequency of fast, narrow gamma-ray bursts,” Nature, vol. 308, p. 434, Mar. 1984.
-  J.-P. Dezalay, C. Barat, R. Talon, R. Syunyaev, O. Terekhov, and A. Kuznetsov, “Short cosmic events - A subset of classical GRBs?” in American Institute of Physics Conference Series, ser. American Institute of Physics Conference Series, W. S. Paciesas and G. J. Fishman, Eds., vol. 265, 1992, pp. 304–309.
-  C. Kouveliotou, C. A. Meegan, G. J. Fishman, N. P. Bhat, M. S. Briggs, T. M. Koshut, W. S. Paciesas, and G. N. Pendleton, “Identification of two classes of gamma-ray bursts,” The Astrophysical Journal, vol. 413, pp. L101–L104, 1993.
-  S. Mukherjee, E. D. Feigelson, G. Jogesh Babu, F. Murtagh, C. Fraley, and A. Raftery, “Three Types of Gamma-Ray Bursts,” The Astrophysical Journal, vol. 508, pp. 314–327, Nov. 1998.
-  N. Berry and R. Maitra, “TiK-means: Transformation-infused -means clustering for skewed groups,” Statistical Analysis and Data Mining – The ASA Data Science Journal, vol. 12, no. 3, pp. 223–233, 2019.
-  I. A. Almodóvar-Rivera and R. Maitra, “Kernel-estimated nonparametric overlap-based syncytial clustering,” Journal of Machine Learning Research, vol. 21, no. 122, pp. 1–54, 2020.
-  B. G. Tóth, I. I. Rácz, and I. Horváth, “Gaussian-mixture-model-based cluster analysis of gamma-ray bursts in the BATSE catalog,” Monthly Notices of the Royal Astronomical Society, vol. 486, no. 4, pp. 4823–4828, 05 2019. [Online]. Available: https://doi.org/10.1093/mnras/stz1188
R. E. Kass and A. E. Raftery, “Bayes factors,”Journal of the American Statistical Association, vol. 90, pp. 773–795, 1995.
-  K. Lo and R. Gottardo, “Flexible mixture modeling via the multivariate t-distribution with the Box-Cox transformation: an alternative to the skew t-distribution,” Statistics and Computing, vol. 2, no. 1, pp. 33–52, 2012.
-  A. Aitken, “A series formula for the roots of algebraic and transcendental equations,” Proceedings of the Royal Society of Edinburgh, vol. 45, no. 1, pp. 14–22, 1926.
-  D. Böhning, E. Dietz, R. Schaub, P. Schlattmann, and B. Lindsay, “The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family,” Annals of the Institute of Statistical Mathematics, vol. 46(2), pp. 373–388, 1994.
-  C. Biernacki, G. Celeux, and E. M. Gold, “Assessing a mixture model for clustering with the integrated completed likelihood,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 719–725, 2000.
-  K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” 2012. [Online]. Available: http://matrixcookbook.com
-  R. Maitra, “Efficient bandwidth estimation in 2D filtered backprojection reconstruction,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5610–5619, 2019.