1 Introduction
The goal of clustering is to discover partitions that assign observations into meaningful groups. A favorable property of Bayesian modelbased clustering is that it provides versatile posterior uncertainty assessment on both the model parameters and cluster allocation estimates. However, while inference on modelspecific parameters and mixing weights follow standard Bayesian practice, more development on estimating the clustering partition is needed.
An intuitive, yet naive, way to obtain a point estimate of the best partition is to use a maximum a posteriori (MAP) approach which selects the partition (up to label switching) from the MCMC sample that occurs the most frequently. But when the number of observations is large and the generating mixture is complex, the majority of the MCMC clustering samples are likely to be visited, at most, a few times. In this case, the MAP approach will not usually find the best clustering solution.
To better clarify the notion of optimal partitioning, Binder (1978)
introduced a loss function approach. This considers optimal clustering as a
Bayesian action which attempts to minimize the expected loss of the partition. Under Binder’s linear loss function, the problem reduces to searching for a partitionthat produces a binary affinity matrix
nearest to the pairwise posterior similarity matrix with . The idea of using loss functions has inspired researchers to develop more advanced approaches to Bayesian clustering estimation. Fritsch et al. (2009) showed that minimizing Binder’s linear loss is equivalent to maximizing the Rand index. One extension, the adjusted Rand index, corrects for the chance of random agreement in the Rand index (Hubert and Arabie, 1985). More recently, Wade and Ghahramani (2018) use a loss function based upon variation of information (Meilă, 2007) as well as providing a method of assessing the uncertainty in the estimated partition, and Rastelli and Friel (2018) introduce a greedy algorithm to efficiently find the partition that optimizes a large family of loss functions. Other influential works include Quintana and Iglesias (2003) who adapt Bayesian clustering into product partition models. Moreover, Lau and Green (2007)argue that binary integer programming is notoriously slow for Bayesian clustering estimation and provide a heuristic itemswapping algorithm that excels computationally.
We find that there are several places where these current approaches can be improved. The optimization strategy of directly shuffling the clustering labels to find a binary matrix closest to the posterior similarity can be notoriously slow. Furthermore, conventional methods can only provide hard clustering by the nature of direct label manipulation. They possess no capability of accounting for ambiguous observations. Moreover, they tend to favor two extremes which either treat all uncertain points as singleton clusters or simply lump them into a major neighbor cluster (Fritsch et al., 2009; Wade and Ghahramani, 2018).
In this paper, we propose the use of nonnegative matrix factorization (NMF) to identify optimal partitions from Bayesian modelbased clustering models. We find the NMF approaches not only outperform alternative methods on clustering accuracy but can also provide deeper interpretations of the partitioning results. Additionally, the clustering solutions produced by NMF are more compelling since they can carefully balance between the singletonpreferred and dominantpreferred extremes.
Section 2 briefly reviews the critical components of a Bayesian modelbased clustering procedure and the relevant output from MCMC. In section 3, we briefly describe the general theoretic decision framework for clustering estimation and then present the details for three current methods. Section 4 will introduce NMF in the context of clustering estimation and further present the connections between NMF and Binder’s criterion. Simulated experiments in section 5 illustrate the clustering performance of the NMF models and their soft partitioning capabilities. This is followed with examples from the famous galaxy and crabs datasets showing that the NMF partitioning solutions provide an attractive balance between the two clustering extremes for ambiguous data points.
2 Bayesian Modelbased Clustering
2.1 Bayesian Mixture Models
Mixture models have been widely applied as a clustering tool to find meaningful patterns in a dataset. In Bayesian modelbased clustering, observations that are determined to share the same modelspecific parameters are considered to be from the same cluster.
To formally describe Bayesian modelbased clustering, we introduce the notations as follows. Let
be the observed data vector generated by a
component mixture model and be the group labels corresponding to the observations. If observation is from the component of the mixture (i.e., ), we say that ’s modelspecific parameter is and the distribution of has the form . Notice that each element of uniquely corresponds to a cluster. In addition, thecomponent of the mixture has a weight or mixing probability
defined as .There are many approaches to estimating the model parameters, including the partition vector
, in mixture models. In frequentist approaches, such as Expectation Maximization (EM), the membership vector
is treated as a latent vector and estimated as part of the completedata likelihood (Fraley and Raftery, 2002). The Bayesian alternatives usually rely on MCMC sampling to conduct estimation and posterior inference. For example, Richardson and Green (1997) use a reversible jumping MCMC algorithm which simultaneously estimates the allocation vector as well as the number of clusters . Nonparametric Bayesian alternatives, like the Dirichlet process (DP) mixture model (Ferguson, 1973; Neal, 2000), consider the model parameters coming from an infinite mixture of distributions. Only a finite number of parameters will be used to model the observed (finite) data and the number of clusters from such a model is the number of unique modelspecific parameters (Blackwell and MacQueen, 1973; Rasmussen, 1999; Escobar and West, 1995; Fritsch et al., 2009).2.2 MCMC output
A critical relationship describes the sampling process of the allocation vector :
(1) 
where contains all the mixture model parameters except the allocation . For a DP mixture model, Quintana and Iglesias (2003) write the prior distribution for the allocation vector as
(2) 
where is the number of observations in cluster and is the DP concentration parameter. Combined with a proper setting of the mixture model (Dahl, 2005), a Gibbs sampler can be used to obtain the MCMC samples of the partitioning vector, . More general distributions based on DP prior can be found in Ishwaran and James (2001); Pitman and Yor (1997); Lijoi et al. (2007). Additionally, Lau and Green (2007) provide a detailed review of various prior distributions.
A convenient way to summarize the partition information, which is unaffected by label switching and can be used when the number of clusters is not equal for all MCMC samples, is with the pairwise posterior similarity matrix. Intuitively, two data points are more likely to be members of the same cluster when they appear together frequently in the partitions . To quantify this relationship, a pairwise posterior similarity matrix is defined by
(3) 
where and are the cluster assignments of observations and . When the true probabilities are unknown, the posterior similarity can be estimated from the MCMC samples
(4) 
where equals if and otherwise.
3 Point Estimate of Clustering
3.1 Expected Loss
Suppose a loss function that quantifies the cost of estimating a partition with . The posterior expected loss of using partition is
(5) 
where is the posterior PMF of partition . The clustering goal is to find the Bayes optimal partition i.e., the partition which minimizes (5). However, because it is often infeasible to obtain a closedform expression of , the posterior expected loss is often approximated by its sample version:
(6) 
where is the partition from the MCMC sample.
While the approximation above is intuitive, computation can be expensive when is large. An alternative approximation is to move the expectation inside the loss function
(7) 
where is the posterior expected partition. Note that under linear loss functions, the approximation sign in Eq. 7 is a strict equality (Binder, 1978). Moreover, Fritsch et al. (2009) and Wade and Ghahramani (2018) argue that this approximation gives similar solutions under many useful nonlinear loss functions. To practically evaluate the approximation, one only needs to replace the posterior expectation of a label agreement in the loss function by the estimated posterior similarity , or that of disagreement by . We will present several loss functions and illustrate how they estimate the expected loss from Eq. (6) and Eq. (7).
3.2 Current Estimation Approaches
In general, a procedure gives the point estimate of the clustering as
(8) 
where is the search space and is a penalty function. This optimization problem requires three settings: the choice of penalty function , which will correspond to an estimated expected loss function, the specification of the search space , and the optimization algorithm to find the that minimizes .
3.2.1 Binder’s Criterion and Variations
An early work on Bayesian cluster analysis is Binder (1978), which proposes the Binder loss function
(9) 
where and are the loss weights for two different types of misassignment. The first term represents a false negative where and actually belong to the same cluster (i.e., ) but disagrees and the second term represents a false positive when and are from different clusters in reality but the estimate assigns them to the same cluster. When the loss weights are equal , the expected Binder loss becomes
(10) 
where . This leads to the penalty function under Binder’s loss
(11) 
which replaces with the approximation from Eq. (4) As discussed in Sec. 3.1, the linearity of Binder’s loss allows the use of Eq. (7) with equality.
Another popular criterion for measuring the quality of a clustering solution is the expected Rand index (Rand, 1971). However, Fritsch et al. (2009) argue that maximizing Rand index is equivalent to minimizing the Binder’s loss since they have the following relationship
(12) 
Additionally, Dahl’s criterion (Dahl, 2006) uses another form of quadratic loss function
(13) 
which also can be proved equivalent to Binder’s loss (Fritsch et al., 2009). Thus, in this paper, we only use the original Binder’s criterion to represent all the equivalent variations.
3.2.2 PEAR Criterion
The adjusted Rand index adjusts for the expected number of arbitrary chance agreements allowing a more consistent comparison between clusterings with different group sizes (Milligan and Cooper, 1985; Morey and Agresti, 1984; Hubert and Arabie, 1985). By maximizing the posterior expectation of adjusted Rand index (PEAR), Fritsch et al. (2009) developed MaxPEAR, a partitioning method that estimates the optimal clustering from the posterior similarity matrix.
The adjusted Rand index between the candidate partitioning and the true clustering is given by
(14) 
Using the MCMC approximation from Eq. (6), adjusted Rand can be written as
(15) 
When is large, however, excessive evaluations of Eq. (14) in Eq. (15) become computationally expensive. Therefore, we use the alternative approximation implemented in (Fritsch et al., 2009). This uses Eq. (7) to provide a solution based on the estimated posterior similarity matrix ,
(16) 
3.2.3 Variation of Information
Another loss function, based on variation of information (Meilă, 2007), was recently used by Wade and Ghahramani (2018) and Rastelli and Friel (2018) . The VI loss function is formally defined by the following equation
(17) 
where counts the number of observations in cluster under the true clustering while assigned to cluster by the candidate partition . In addition, and . and are the number of clusters in partitions and respectively.
We consider the penalty function introduced in Wade and Ghahramani (2018) which approximates the expected VI loss using using Eq. (7)
This is the lower bound on the expected VI loss, it avoids the computational expense of using all the MCMC sample partitions when is large and also provides accurate solutions Wade and Ghahramani (2018).
3.2.4 Partition Search Space
The solution to Eq. (8) is obtained by scanning through all possible candidate partitions in the search space . Ideally, the search space contains all possible partitions, where is the Bell number for observations. However, this will be impractical when is large. Even the search space for a fixed number of clusters contains the Sterling number of the second kind (Weisstein, 2002)
(18) 
possible partitions. When an exhaustive search of is not feasible, the search space can be reduced or a nonexhaustive search can be performed.
The most straightforward reduction method uses the MCMC clustering samples as the reduced search space. This will provide at most unique candidate partitions (Dahl, 2006)
. Another popular choice uses the dendrogram from applying hierarchical clustering to the pairwise dissimilarity matrix,
. This method provides candidate partitions; one for every unique solution from the dendrogram.Instead of refining the search space, other researchers attempt to resolve the optimization problem with more efficient algorithms. For example, Lau and Green (2007) propose a heuristic itemswapping algorithm aiming at outperforming the naive and slow binary integer programming algorithm. However, Fritsch et al. (2009) claim that the algorithm is still computationally costly for a large number of observations. Another greedy search algorithm proposed by Wade and Ghahramani (2018) aims to find a partitioning that minimizes the posterior expect loss within a neighborhood around the partition at the current iteration. We found this algorithm to also be timeconsuming in our experiments with a moderately sized dataset. Rastelli and Friel (2018) propose a greedy algorithm that, starting from an initial random partition, changes the membership of one observation at a time to the cluster that minimizes the overall loss. Multiple starts are suggested to enlarge the search space.
4 Optimal Partitioning using NMF
Recall from (8) that the objective of clustering is to find the partition that minimizes a specific penalty function. The existing penalty functions (11, 16, 3.2.3) are based on the estimated pairwise similarity matrix and the binary partition matrix created from a candidate partition . In essence, these methods are attempting to estimate the nonbinary with the binary . This observation motivated us to consider nonbinary estimators that could provide better estimates, are less computationally demanding, and provide a way to access the uncertainty in observations with ambiguous cluster assignments.
Therefore, this paper suggests a partitioning approach that uses nonnegative matrix factorization (NMF) to approximate the similarity matrix . Before a formal introduction of the NMF, we list some of its advantages. First, NMF approximates the realvalued matrix using a lowerrank decomposition which is also realvalued. Without being restricted to a binary matrix, the optimal solution from a broader space could be much closer to . Second, research on NMF optimization is mature with many wellstudied algorithms possessing useful properties for clustering. NMF partitioning methods also facilitate soft clustering. Demonstrated by Li and Ding (2006), output from NMF optimization has a natural soft assignment interpretation and in fact, special NMF models are directly designed for probabilistic clustering problems (Ding et al., 2005; Paisley et al., 2014; Shashanka et al., 2008).
4.1 Formulation
Nonnegative matrix factorization (NMF) decomposes a data matrix into lowerrank matrices which can help reveal underlying patterns of the data. Li and Ding (2006) have studied how various NMF constraints can help uncover different types of clustering patterns. The connections between NMF and clustering have been well established and successfully applied to many research fields (HosseiniAsl and Zurada, 2013; Ding et al., 2005; Kuang et al., 2012; Wang and Zhang, 2013; Xu et al., 2003).
Let the posterior similarity matrix be the data matrix to be approximated. In practice this would be the approximation from (4). The basic NFM problem (Lee and Seung, 2001) is finding two lowerrank matrices that solve the optimization
(19) 
where indicates the Frobenius norm, , and . The nonnegativity in NMF indicates that matrix elements must be nonnegative.
This leastsquares type of NMF problem has a natural clustering interpretation for nonnegative data because the columns of can be considered centroids and the columns of are the cluster weights of the observations (Kuang et al., 2012). This provides a close connection with the means algorithm Kim and Park (2008b); Ding et al. (2005). While means restricts to be a binary matrix which implies hard assignment of cluster allocation, the NMF method results in a realvalued positive matrix which contains both hard and soft clustering information. Specifically, the hard cluster allocation of observation can be estimated using the following equation
(20) 
which is similar to MAP method. In other words, the row index of the maximum value of a column represents the cluster membership of the corresponding observation. In addition, the soft label assignment of observation can be obtained by standardizing the columns of
(21) 
where is the column of matrix . To interpret the soft clustering, the vector describes the confidence levels of assigning observation into the clusters. The higher the value of , for example, the more confidence there is in assigning observation into cluster . To further understand the soft clustering nature of NMF, one can follow the claim by Ding et al. (2005) that a strictly orthogonal matrix leads to hardened partitions and a nearorthogonal provides soft clustering which can be interpreted as posterior cluster probabilities.
4.2 Connection to Binder’s Criterion
Although nonnegative matrix factorization and Binder’s criterion for clustering originated in different fields, they share several similarities. Table 1 presents some connections between Binder’s criterion and NMF.
Binder’s Criterion  NMF Partition  
optimization  
approximation  binary  real 
partition  hard  hard or soft 
The first line of Table 1 compares the penalty function of these two approaches. NMF is attempting to minimize the square distance between two similarity matrices and , while Binder’s criterion seeks to minimize the absolute distance. But as shown in Fritsch et al. (2009), under Binder’s criterion with binary , the absolute distance provides the same solution as the squared distance. Regarding the approximation matrix, Binder’s criterion seeks a binary matrix that specifies a hard partition , while NMF approximates with a realvalued matrix that can be decomposed into two nonnegative matrices and . These matrices can be used to obtain a hard partition (20) or soft partition (21).
4.3 NMF Variations and Algorithms
Using our previous notation of penalty function, the NMF model in Eq. (19) can be written in a general format as
(22) 
where is a generic notation for the distance or loss between two matrices. While the baseline NMF model in Eq. (19) uses the squarederror loss (NMFls), another popular choice of
is the generalized KullbackLeibler divergence (NMFkl), which can capture different features
(Brunet et al., 2004). In addition, two other variations of NMF models are considered. NMFns by PascualMontano et al. (2006) introduces a smoothing matrix between and to avoid deteriorating their sparseness. It not only provides quality reconstruction to the original data but also produces more interpretable basis features in matrix . NMFoffset by Badea (2008) adds an intercept term to absorb the common features of the clusters. In this paper, we will include the four NMF models presented above.Another important choice in NMF is the optimization algorithm. While NMF has a variety of extensions, there are two primary types of optimization algorithms. Lee and Seung (1999) propose multiplicative algorithms for squarederror loss (NFMls) and the generalized KullbackLeibler divergence (NFMkl) which guarantee convergence. Another major algorithm family is based upon projected gradient descent proposed by Lin (2007) and extended by Kim and Park (2007, 2008a). For the squarederror NMFls model, the following two equations present the elementwise updating rules for both basis matrix and weight matrix
(23) 
(24) 
Algorithms for the other NMF models can be found in the original papers (Badea, 2008; Brunet et al., 2004; PascualMontano et al., 2006).
As pointed out by Vavasis (2010), while solving the exact NMF problem in Eq. (19) is NPhard, local heuristic search methods can be developed. The basic NMF algorithms are computationally intensive; each iteration of (23) and (24) will cost at most operations, where is the sample size and is the number of components, but will often be much less due to the sparsity in . And while these calculations are performed for multiple initializations and for a range of values, parallel (Li and Ding, 2013) and GPU (MejíaRoa et al., 2015) based implementations are available. There have been many other developments in improving the computation of NMF, especially for largescale problems (Kim and Park, 2011; Wang et al., 2011; Gemulla et al., 2011; Wang et al., 2016).
The rank of the matrix factorization, , also needs to be determined. Popular methods to select the rank include looking for an inflection point in the internal NMF loss function (Hutchins et al., 2008) and cophenetic correlation (Brunet et al., 2004). However, this paper uses the optimization functions (11, 16, 3.2.3) to estimate . Specifically, NMF is run for a range of values and for each a hard partition is made using Eq. (20) and then scored from one of the optimization functions. In this way, we are using NMF as a way to create and explore the search space .
5 Examples
5.1 Synthetic Example
5.1.1 Clustering Performance on Gaussian Mixtures
To study the clustering performance of different partitioning models, Fritsch et al. (2009) and Wade and Ghahramani (2018) used Gaussian mixture data under various settings of separateness and balancedness. However, there are many other setting parameters worthy of attention in Gaussian mixture (Melnykov et al., 2010)
. Without extending the focus into all the possible aspects of a Gaussian mixture models, we include one additional factor,
sphericity, which measures the roundness of a covariance matrix of the Gaussian components.In our experiments, we simulate data from 4 components (clusters) and choose two levels, T/F, for each dimension of the triplet (separateness, balancedness, sphericity) resulting in eight different types of datasets. For example, TTT represents an easytocluster dataset that is well separated, perfectly balanced, and completely spherical (i.e., identity covariance matrix). The configuration FFF represents a more mixed, imbalanced, and nonspherical Gaussian mixture dataset.
Table 2 displays the values of each configuration. Specifically, the separateness is controlled by the distance from the components means (
) to the origin. The second column shows the cluster sizes under the two balancedness settings and the last column explains the structure of the variancecovariance matrices
. Data is simulated 100 times for each configuration of Table 2. Fig. 1 visualizes one realization for each configuration.For each realization, we fit a Dirichlet process Gaussian mixture model as described in Liverani et al. (2015). This models the data as arising from an infinite mixture of Gaussian components. The prior model for the component means are and covariance matrix
. The hyperparameters
, with the range of the covariate, is set to of the inverse of the empirical covariance matrix, and . These are the default and recommended values for the hyperparameters for bivariate data in the corresponding R package PReMiuM.After a burnin of 5,000, we retained 10,000 MCMC samples and created the posterior similarity matrix using Eq. (4). From the posterior similarity matrix the method specific partitions are obtained. The search set, , for the three conventional methods is obtained from the union of the partitions generated from each unique cut of the dendrogram from agglomerative hierarchical clustering using average linkage with the dissimilarity . The MinBinder method chooses the partition that minimizes Eq. (11), MaxPear chooses the partition that minimizes Eq. (16), and MinVI chooses the partition that minimizes Eq. (3.2.3). We also considered choosing the partition that minimized the full loss functions (Eq. (14) and Eq. (17)), but did not find a substantial difference in performance and thus did not include the results. We also compared the related method of Medvedovic (Medv) (Medvedovic and Sivaganesan, 2002; Medvedovic et al., 2004) which uses complete linkage and the partition that results from cutting the dendrogram at a value close to one (we used ).
We considered four NMF models: NMFkl, NMFls, NMFns, NMFoffset. For each value of , the four models were fit ten times using different random initializations (all methods shared the same initializations) using the algorithms of Gaujoux and Seoighe (2010). We retained the final solution that minimized the internal NMF loss (e.g., squared error). The optimal was then chosen by making hard partitions according to Eq. (20) and using the partition that minimized one of the expected loss functions.
To examine how well the methods can identify the true groups, we evaluated the partitions against the three metrics: Rand index (12), adjusted Rand (AR) index (14), and the variation of information (VI) (17). The results are summarized in Table 3 which is arranged similarly to the layout of Fig. 1. Corresponding to each data panel, there is a 7by3 table presenting the model performances under that particular setting. In each column under the same setting, we underline the model with the best performance. Notice that the VI criterion is to be minimized while the Rand and AR are to be maximized.
The first noteworthy result is that NMF estimation methods are consistently better than the three conventional approaches when the data are more ambiguous (i.e., efgh). In other words, when the data are more mixed, the NMFbased methods can better identify the actual clustering structure. Even with the exceptions under setting eFTF, NMFns and NMFoffset still provide solutions of comparable performance. This phenomenon pertains to the fact that traditional methods tend to group uncertain points into individual clusters, which notoriously overidentifies the number of clusters.
Not surprisingly, all the estimation methods perform well on the separated, perfectly balanced, and nicely rounded data setting aTTT. However, NMFoffset method actually outperforms other candidates under the VI criterion, which is observed in setting bTTF as well. Another observation seen by comparing ab as well as gh is that changing sphericity only will result in the same optimal estimation approaches. When the simulated data are well separated and unbalanced (cd), NMFkl is the best procedure among the NFM methods and is competitive with MinVI.
Fig. 2 provides a summary of the selected number of clusters for all the 100 runs. Recall that the data were simulated with clusters. To keep perspective in the plot, is truncated at 8. As the cluster structure becomes more ambiguous (efgh), MinBinder and MaxPEAR tend to inflate the estimated cluster size. The MinVI and Medv approaches move in the opposite direction and tend to underestimate the number of clusters, especially for settings eFTT and gFFT. The NMFls and NMFks provide more overall consistent estimates of the number of clusters. The NMFns does a good job of selecting the number of clusters when the clusters are well separated but suggests too many clusters when the clusters are more overlapping. The NMFoffset method does not, in general, provide a consistent estimate of the number of clusters.
5.1.2 Soft Clustering
By providing a realvalued matrix , NMF can induce both hard partitioning labels and the soft (probabilistic) cluster assignments. To illustrate soft clustering, we simulated a dataset comprising four Gaussian components with two smaller clusters. The original data clusters are showed in the left panel of Fig. 3. On the right side, we also present the hard partitioning result provided by the conventional methods (i.e., MinBinder, MaxPEAR, and MinVI). Since they reach the same solution, only one picture is needed here.
All the conventional methods choose which underestimates the truth () by assigning observations 3136, that form the minor clusters, to one of the larger groups. They assign observations 3133 to the right cluster and 3436 to the left cluster indicating they prefer creating two elongated (nonspherical) clusters. While these observations do not correspond closely to the other observations in their assigned cluster, there is no way to determine this from the conventional methods.
To illustrate the effectiveness of approximating using NMF, we present the reconstructed posterior similarity matrix in Fig. 4 using the most generic NMF model (NMFls). As clearly depicted in (4b) and (4d) the NMF solution provides a better approximation to than that of the binary constrained conventional approaches showed by (4c). The NMF solution accurately detects and restores the ambiguity around the observations from 31 to 36. In fact, its similarity heatmap is almost identical to that of in (4b).
In Fig. 5, we further investigate NMF’s capability to identify uncertain clusters. The top panel of Fig. 5 shows the expected Binder’s loss (Eq. (11)) under a sequence of . While minimizes the Binder penalty function, the loss is not much greater with and indicating that good solutions have between 2 and 4 clusters.
The middle and bottom panels of Fig. 5 show the standardized rows (Eq. (21)) of the matrices corresponding to the and solutions respectively.
When and hard clustering is preferred, the maximum row value corresponding to each observation will determine the cluster membership obeying Eq. (20). For example, observations 3133 who have larger values on the redboxed curve will be assigned into the red cluster. Therefore, while the hard partitioning from NMF model coincides with the ones given by conventional models, the NMF model is able to reveal the notable ambiguity on membership assignments, especially for observation 31.
In addition when , NMF uncovers the true clustering structure where observations 3133 and 3436 are grouped as minor clusters. It can also be observed that the bluedotted and purplecrossed curves have similar profiles indicating that observations 115 and 3436 have strong similarities even though they would be assigned to different clusters. This type of softclustering information that is present in NMF provides a deeper understanding of the uncertainty that exists in the clustering solutions.
5.2 Galaxy Data
A popular dataset for research in Bayesian modelbased clustering is the galaxy data (Roeder, 1990) which records velocities of 82 galaxies from the Corona Borealis region. While the original work aims at developing nonparametric density estimation for the onedimension galaxy speed, latter research finds this dataset compelling for modelbased clustering (Escobar and West, 1995; Lau and Green, 2007; Wang and Dunson, 2011).
In this paper, we use the MCMC samples of clusterings from a DP mixture fitted to the galaxy velocities (in 1,000 km/sec) provided by the R package mcclust.ext (Wade and Ghahramani, 2018), which also contains the predicted density using the fitted DP mixture model. The MCMC sampling provides 10,000 draws of the clustering labels with 9,636 unique partitions (up to label switching).
Fig. 6 shows the estimated clustering solutions for each model. Within the traditional methods, MinBinder and MaxPEAR produce the same solution which has a relatively large by treating observations with high uncertainty (i.e., observations 8, 9, 78, and 79) as singletons. The Medvedovic and NMFns methods move in the other direction and suggest there are only two clusters. Interestingly, they both indicate there is a component with large variance to account for the observations in the tails. Specifically, Medvedovic indicates observations
are in the outlier component while NMFns also includes observations 8 and 9.
The remaining approaches select
, but differ in their cluster assignments. The MinVI approach assigns the ambiguous observations to the dominant cluster in the middle. This suggests that each component could plausibly be modeled with a rather symmetric distribution. The NMFkl model only differs in that it assigns observation 79 into the right cluster suggesting the presence of a skewed component with a heavy tail. The NMFls also includes observations 77 and 78 into the right component. NMFoffset is the most extreme, putting the first and last 9 components into the left and right components, respectively, suggest two highly skewed components.
To get a better sense of the uncertainly present in the clustering, Fig. 7 displays the softclustering estimates using NMFkl. For , this analysis reveals that the cluster assignments for observations and are highly uncertain; something not seen from the hard classifications. It also shows that observation 77 has an almost 20% chance of being part of cluster 3. The results for indicate that clusters 2 and 3 have substantial overlap but clearly prefers to include observations into cluster 3. When , the model is more confident about observations but less confident about observations .
5.3 Crabs Data
Leptograpsus crabs dataset (Campbell and Mahon, 1974) has 200 crabs consisting of two species (i.e., orange and blue) each with 50 females and 50 males. This gives 4 evenly sized clusters of 50 crabs of each combination of sex and species. Five morphological measurements are collected for each crab including front lobe size (FL), rear width (RW), carapace length (CL), carapace width (CW), and body depth (BD) all in mm. A common, yet interesting, problem on the crabs dataset is that the number of clusters tends to be easily overestimated using all the five variables (Raftery and Dean, 2006). To overcome this problem using Gaussian mixture models, Raftery and Dean (2006) and Maugis et al. (2009) propose different model selection strategies to reduce variable redundancy.
Fig. 8 shows the posterior similarity matrix constructed from the MCMC sampling (using the same settings described in Section 5.1.1) Table 4 shows the performance under the three clustering quality measures.
While the posterior similarity matrix reveals some uncertainty in which observations should be grouped together, all methods choose a correct number of clusters (
) except MinBinder. The clear winner in this example is the NMFoffset method, which correctly classifies the most crabs and has the best Rand, AR, and negative VI scores.
NMFls  NMFkl  NMFns  NMFoffset  MinBinder  MaxPear  MinVI  Medv  
Rand  0.912  0.915  0.912  0.924  0.917  0.915  0.915  0.912 
AR  0.765  0.774  0.765  0.799  0.779  0.774  0.774  0.765 
VI  0.762  0.744  0.762  0.671  0.711  0.744  0.744  0.762 
5.4 NMF Scalability
Random pairwise posterior similarity matrices were created for . The four NMF models were run using and the computing time is provided in Fig. 9. These timings are for 10 random initializations using only one core with the R package NMF. While this does show the scaling of NMF in a basic implementation, these times represent a rather worst case scenario as convergence will be slow (since there is no actual structure in the data) and no parallel implementation was used. While the NMF R package does facilitate parallel computation for each initialization (details given in Gaujoux (2018)), it does not include algorithms to exploit the sparseness of the posterior similarity matrices.
6 Conclusions
This paper addresses the problem of estimating the optimal partition from Bayesian cluster models. Particularly, a nonnegative matrix factorization (NMF) framework has been proposed that uses the pairwise posterior similarity matrix constructed from the MCMC clustering samples to obtain the most appropriate clustering estimates under various penalty functions. Instead of a direct optimization using label manipulation, the NMF approach implicitly induces the partition from the weight matrix supplied by the flexible lowrank approximation . Another favorable property of the NMF models is a soft/probabilistic interpretation of the clustering label assignment which can be helpful in understanding the uncertainty in the partition.
In our experiments, we found several practical advantages of the NMF partitioning models. Their clustering capability outperformed existing approaches under various Gaussian mixture settings and performance measurements. Additionally, NMF was able to recognize ambiguous observations located around the boundaries of major clusters. Even under a misspecified number of clusters, NMF could still identify the uncertain data points, which is something not available with hardclustering methods. The galaxy data application demonstrates that NMF can provide balanced solutions between the two extremes, favored by the conventional models, of singletonpreferred (MinBinder and MaxPEAR) and dominantpreferred (MinVI). The NMF approach can also help reveal the uncertainty in the label assignments from an analysis of softclustering probability estimates. The analysis of the crabs data further illustrates the advantages of the NMF solutions. By providing a larger search space, the NMF approach has a better chance of identifying the optimal hard partition. This property is especially appealing when there is much uncertainly in the posterior similarity matrix.
We have not exhausted the options available with NMF. There are many other divergence metrics (Li and Ding, 2006), more rigorous probabilistic soft clustering approaches (Zhao et al., 2015), and additional models utilizing structural constraints. With the symmetric similarity matrix as an input, for example, He et al. (2011) propose a symmetric NMF model with the approximation to explicitly take advantage of the symmetry of .
The use of NMF for clustering from the posterior pairwise similarity matrix opens up the potential to use spectral clustering in the mixture model framework.
Ding et al. (2005), for example, show how spectral clustering of a graph Laplacian can be written as an NMF problem. Building on this, if we consider the pairwise similarity matrix in (4) to represent a graph weight matrix then methods from graph clustering, commonly called community detection, can also be used (Fortunato, 2010). This represents an exciting future direction.The NMF approach requires optimizing over the rank of the basis matrix . This is done by running NMF for a set of values, each with multiple initializations. While this leads to an increase in computation it also provides an expanded search space which gives a better chance of finding the globally optimal partition. The two recent methods of Wade and Ghahramani (2018) and Rastelli and Friel (2018) suggest greedy algorithms to judiciously explore the search space by considering a potentially different number of components at each step.
This paper did not substantially deal with the important issue of choosing the appropriate loss function, but rather focused on demonstrating that NMF can perform well under several common loss functions. However we did find evidence in the simulation study and examples, similar to (Fritsch et al., 2009; Wade and Ghahramani, 2018; Rastelli and Friel, 2018), that the Rand (MinBinder) and Adjusted Rand (MaxPEAR) losses tended to create singleton clusters that overestimated the number of components as well as evidence, like Rastelli and Friel (2018), that the VI (MinVI) loss can slightly underestimate the number of components in the most complex settings. Regardless of the loss function used, however, we found that an NMF method could find the correct number of components more often than the conventional approaches.
This paper develops NMF for the postprocessing step of estimating the optimal partition from the pairwise posterior similarity matrix; we pay little attention to the choice of component model and hyperparameters that produce the similarity matrix. A promising future direction is the exploration of how the choice of component (e.g., symmetric, skewed, or heavytailed kernels) and hyperparameters interact with the aggregation used to estimate the similarity matrix to impact the optimal partitions.
The softclustering interpretation of the matrix (21) provides a measurement of the uncertainty in the component membership of each observation individually. The joint uncertainty in cluster memberships could be obtained with the credible ball method of Wade and Ghahramani (2018) which requires a best hard partition and the MCMC sample partitions.
References
 Badea (2008) Badea, L., 2008. Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization., in: Pacific Symposium on Biocomputing, pp. 279–290.
 Binder (1978) Binder, D.A., 1978. Bayesian cluster analysis. Biometrika 65, 31–38.
 Blackwell and MacQueen (1973) Blackwell, D., MacQueen, J.B., 1973. Ferguson distributions via pólya urn schemes. The annals of statistics 1, 353–355.
 Brunet et al. (2004) Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P., 2004. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences 101, 4164–4169.
 Campbell and Mahon (1974) Campbell, N., Mahon, R., 1974. A multivariate study of variation in two species of rock crab of the genus leptograpsus. Australian Journal of Zoology 22, 417–425.
 Dahl (2005) Dahl, D.B., 2005. Sequentiallyallocated mergesplit sampler for conjugate and nonconjugate Dirichlet process mixture models. Technical Report. Texas A&M.
 Dahl (2006) Dahl, D.B., 2006. Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press. chapter Modelbased clustering for expression data via a Dirichlet process mixture model. pp. 201–218.
 Ding et al. (2005) Ding, C.H., He, X., Simon, H.D., 2005. On the equivalence of nonnegative matrix factorization and spectral clustering., in: SDM, SIAM. pp. 606–610.
 Escobar and West (1995) Escobar, M.D., West, M., 1995. Bayesian density estimation and inference using mixtures. Journal of the american statistical association 90, 577–588.
 Ferguson (1973) Ferguson, T.S., 1973. A bayesian analysis of some nonparametric problems. The annals of statistics 1, 209–230.
 Fortunato (2010) Fortunato, S., 2010. Community detection in graphs. Physics reports 486, 75–174.
 Fraley and Raftery (2002) Fraley, C., Raftery, A.E., 2002. Modelbased clustering, discriminant analysis, and density estimation. Journal of the American statistical Association 97, 611–631.
 Fritsch et al. (2009) Fritsch, A., Ickstadt, K., et al., 2009. Improved criteria for clustering based on the posterior similarity matrix. Bayesian analysis 4, 367–391.
 Gaujoux (2018) Gaujoux, R., 2018. An introduction to NMF package. URL: https://cran.rproject.org/package=NMF. R package version 0.20.6.
 Gaujoux and Seoighe (2010) Gaujoux, R., Seoighe, C., 2010. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics 11, 367. URL: http://www.biomedcentral.com/14712105/11/367, doi:10.1186/1471210511367.

Gemulla et al. (2011)
Gemulla, R., Nijkamp, E.,
Haas, P.J., Sismanis, Y.,
2011.
Largescale matrix factorization with distributed stochastic gradient descent, in: Proceedings of the 17
^{th} ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA. pp. 69–77. doi:10.1145/2020408.2020426. 
He et al. (2011)
He, Z., Xie, S., Zdunek,
R., Zhou, G., Cichocki, A.,
2011.
Symmetric nonnegative matrix factorization:
Algorithms and applications to probabilistic clustering.
IEEE Transactions on Neural Networks 22, 2117–2131.
 HosseiniAsl and Zurada (2013) HosseiniAsl, E., Zurada, J.M., 2013. Artificial Intelligence and Soft Computing: 13th International Conference, ICAISC 2014. Springer International Publishing. chapter Nonnegative Matrix Factorization for Document Clustering: A Survey. pp. 726–737.
 Hubert and Arabie (1985) Hubert, L., Arabie, P., 1985. Comparing partitions. Journal of classification 2, 193–218.
 Hutchins et al. (2008) Hutchins, L.N., Murphy, S.M., Singh, P., Graber, J.H., 2008. Positiondependent motif characterization using nonnegative matrix factorization. Bioinformatics 24, 2684–2690. doi:10.1093/bioinformatics/btn526.
 Ishwaran and James (2001) Ishwaran, H., James, L.F., 2001. Gibbs sampling methods for stickbreaking priors. Journal of the American Statistical Association 96, 161–173.
 Kim and Park (2007) Kim, H., Park, H., 2007. Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis. Bioinformatics 23, 1495–1502.
 Kim and Park (2008a) Kim, H., Park, H., 2008a. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications 30, 713–730.
 Kim and Park (2008b) Kim, J., Park, H., 2008b. Sparse nonnegative matrix factorization for clustering. Technical Report GTCSE0801. Georgia Institute of Technology.
 Kim and Park (2011) Kim, J., Park, H., 2011. Fast nonnegative matrix factorization: An activesetlike method and comparisons. SIAM Journal on Scientific Computing 33, 3261–3281. doi:10.1137/110821172.
 Kuang et al. (2012) Kuang, D., Ding, C., Park, H., 2012. Symmetric nonnegative matrix factorization for graph clustering, Society for Industrial and Applied Mathematics, Philadelphia. pp. 106–117.
 Lau and Green (2007) Lau, J.W., Green, P.J., 2007. Bayesian modelbased clustering procedures. Journal of Computational and Graphical Statistics 16, 526–558.
 Lee and Seung (1999) Lee, D.D., Seung, H.S., 1999. Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791.
 Lee and Seung (2001) Lee, D.D., Seung, H.S., 2001. Algorithms for nonnegative matrix factorization, in: Advances in neural information processing systems, pp. 556–562.
 Li and Ding (2006) Li, T., Ding, C., 2006. The relationships among various nonnegative matrix factorization methods for clustering, in: Sixth International Conference on Data Mining (ICDM’06), IEEE. pp. 362–371.
 Li and Ding (2013) Li, T., Ding, C., 2013. Nonnegative matrix factorizations for clustering: A survey, in: Aggarwal, C.C., Reddy, C.K. (Eds.), Data Clustering: Algorithms and Applications. Chapman & Hall/CRC.
 Lijoi et al. (2007) Lijoi, A., Mena, R.H., Prünster, I., 2007. Controlling the reinforcement in Bayesian nonparametric mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 715–740.
 Lin (2007) Lin, C.J., 2007. Projected gradient methods for nonnegative matrix factorization. Neural computation 19, 2756–2779.
 Liverani et al. (2015) Liverani, S., Hastie, D.I., Azizi, L., Papathomas, M., Richardson, S., 2015. PReMiuM: An R package for profile regression mixture models using dirichlet processes. Journal of Statistical Software 64, 1–30. URL: http://www.jstatsoft.org/v64/i07/.
 Maugis et al. (2009) Maugis, C., Celeux, G., MartinMagniette, M.L., 2009. Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701–709.
 Medvedovic and Sivaganesan (2002) Medvedovic, M., Sivaganesan, S., 2002. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18, 1194–1206.
 Medvedovic et al. (2004) Medvedovic, M., Yeung, K.Y., Bumgarner, R.E., 2004. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 20, 1222–1232.

Meilă (2007)
Meilă, M., 2007.
Comparing clusterings–an information based
distance.
Journal of Multivariate Analysis 98, 873–895.
 MejíaRoa et al. (2015) MejíaRoa, E., TabasMadrid, D., Setoain, J., García, C., Tirado, F., PascualMontano, A., 2015. Nmfmgpu: nonnegative matrix factorization on multigpu systems. BMC Bioinformatics 16, 43. URL: https://doi.org/10.1186/s1285901504854, doi:10.1186/s1285901504854.
 Melnykov et al. (2010) Melnykov, V., Maitra, R., et al., 2010. Finite mixture models and modelbased clustering. Statistics Surveys 4, 80–116.
 Milligan and Cooper (1985) Milligan, G.W., Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179.
 Morey and Agresti (1984) Morey, L.C., Agresti, A., 1984. The measurement of classification agreement: An adjustment to the rand statistic for chance agreement. Educational and Psychological Measurement 44, 33–37.
 Neal (2000) Neal, R.M., 2000. Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical statistics 9, 249–265.
 Paisley et al. (2014) Paisley, J., Blei, D., Jordan, M.I., 2014. Bayesian nonnegative matrix factorization with stochastic variational inference. Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC.
 PascualMontano et al. (2006) PascualMontano, A., Carazo, J.M., Kochi, K., Lehmann, D., PascualMarqui, R.D., 2006. Nonsmooth nonnegative matrix factorization (nsnmf). IEEE transactions on pattern analysis and machine intelligence 28, 403–415.
 Pitman and Yor (1997) Pitman, J., Yor, M., 1997. The twoparameter poissondirichlet distribution derived from a stable subordinator. The Annals of Probability 25, 855–900.
 Quintana and Iglesias (2003) Quintana, F.A., Iglesias, P.L., 2003. Bayesian clustering and product partition models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 557–574.
 Raftery and Dean (2006) Raftery, A.E., Dean, N., 2006. Variable selection for modelbased clustering. Journal of the American Statistical Association 101, 168–178.
 Rand (1971) Rand, W.M., 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66, 846–850.
 Rasmussen (1999) Rasmussen, C.E., 1999. The infinite gaussian mixture model., in: NIPS, pp. 554–560.
 Rastelli and Friel (2018) Rastelli, R., Friel, N., 2018. Optimal bayesian estimators for latent variable cluster models. Statistics and Computing accepted. URL: https://doi.org/10.1007/s112220179786y, doi:10.1007/s112220179786y.
 Richardson and Green (1997) Richardson, S., Green, P.J., 1997. On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B (statistical methodology) 59, 731–792.
 Roeder (1990) Roeder, K., 1990. Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association 85, 617–624.
 Shashanka et al. (2008) Shashanka, M., Raj, B., Smaragdis, P., 2008. Probabilistic latent variable models as nonnegative factorizations. Computational intselligence and Neuroscience 2008, 8.
 Vavasis (2010) Vavasis, S.A., 2010. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization 20, 1364–1377. URL: https://doi.org/10.1137/070709967, doi:10.1137/070709967, arXiv:https://doi.org/10.1137/070709967.
 Wade and Ghahramani (2018) Wade, S., Ghahramani, Z., 2018. Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis 13.
 Wang et al. (2016) Wang, D., Nie, F., Huang, H., 2016. Fast robust nonnegative matrix factorization for largescale human action data clustering, in: Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, AAAI Press. pp. 21042110. URL: http://dl.acm.org/citation.cfm?id=3060832.3060915.
 Wang et al. (2011) Wang, H., Nie, F., Huang, H., Ding, C., 2011. Nonnegative matrix trifactorization based highorder coclustering and its fast implementation, in: 2011 IEEE 11th International Conference on Data Mining, pp. 774783. doi:10.1109/ICDM.2011.109.
 Wang and Dunson (2011) Wang, L., Dunson, D.B., 2011. Fast bayesian inference in dirichlet process mixture models. Journal of Computational and Graphical Statistics 20, 196216.
 Wang and Zhang (2013) Wang, Y.X., Zhang, Y.J., 2013. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering 25, 13361353.
 Weisstein (2002) Weisstein, E.W., 2002. Stirling number of the second kind. URL: http://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html.
 Xu et al. (2003) Xu, W., Liu, X., Gong, Y., 2003. Document clustering based on nonnegative matrix factorization, in: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. pp. 267273.
 Zhao et al. (2015) Zhao, H., Poupart, P., Zhang, Y., Lysy, M., 2015. Sof: Softcluster matrix factorization for probabilistic clustering., in: AAAI, pp. 31883195.
Comments
There are no comments yet.