1 Introduction
Clustering, which is a fundamental method for the analysis of univariate and multivariate data, is used for various applications such as data mining, vector quantization, and pattern recognition
Jain (2010)Du (2010)Saxena et al. (2017).The selection of a suitable number of clusters is one of the most difficult problems in clustering. In most situations in which an application needs to perform clustering, the true number of clusters is usually unknown. The selection of as or could lead to a misleading result.
Many methods that have been proposed to solve the problem of selecting
consider it to be a problem of model selection. Several approaches to model selection have been proposed and most of these methods belong to one of two categories. The first is to use the likelihood with a penalized term for the order of a model. The other entails simulation using Markov chain Monte Carlo (MCMC) methods.
In the method using the penalized likelihood, an optimal is selected among candidate models obtained by performing clustering with different , using a predetermined form of the penalized likelihood. Several penalized likelihood forms are proposed from various perspectives. Among these forms, the Akaike information criterion (AIC) Akaike (1974), derived based on the minimization of the Kullback–Leibler (KL) divergence Kullback and Leibler (1951)
between the true model and estimated one, and the Bayesian information criterion (BIC)
Schwarz and others (1978), which is derived from the Bayesian point of view, are widely used. The minimum description length (MDL) Rissanen (1978) and minimum message length (MML) Wallace and Boulton (1968)Wallace and Freeman (1987), which are criteria derived from the viewpoint of coding theory, are also popular. Another form, integrated classification likelihood (ICL) Biernacki et al. (2000), which is an improvement of BIC for the clustering task, is also often used.On the other hand, Bayesian inference methods have also been proposed for model selection. The method of Richardson and Green
Richardson and Green (1997)estimates the posterior probability distribution of
using reversible jump Markov chain Monte Carlo (RJMCMC) Green (1995), which is mainly used for density estimation using a Gaussian mixture model (GMM).
The method using model selection criterion for selecting an optimal model among candidate models is simple, but when a candidate model is identified, it is necessary to carefully determine the initial values of the model parameters. For example, the ExpectationMaximization (EM) algorithm
Dempster et al. (1977) requires the problem of initial value dependence to be considered. Although the latter method based on Bayesian inference can provide abundant information with regard to , the MCMCbased method is computationally expensive.On the other hand, in terms of methods with low initial value dependency and computational efficiency for selecting , approaches using a greedy algorithm have been proposed. Many of these algorithms search for by splitting each cluster. The decision as to whether to split a cluster is made by using a predetermined split decision criterion. Xmeans Pelleg et al. (2000), which is a representative method of this approach, uses a means algorithm and BIC to use the split decision criterion. Some studies for improving Xmeans have been reported Hamerly and Elkan (2004)Feng and Hamerly (2007)Kalogeratos and Likas (2012).Gmeans Hamerly and Elkan (2004)
uses a statistical hypothesis test as a split decision criterion, which tests the hypothesis to determine whether the data in a cluster exhibit a Gaussian distribution, and upon rejection, the cluster is split into two. PGmeans
Feng and Hamerly (2007) is an extension of Gmeans and assumes the dataset has been generated from a GMM and constructs a test to determine whether the cluster has a Gaussian distribution. Dipmeans Kalogeratos and Likas (2012) is another approach that uses a hypothesis test. Dipmeans tests the unimodality of the cluster using Hartigan’s dip static Hartigan and Hartigan (1985) and splits the cluster into two until the distribution within a cluster becomes unimodal.In contrast to a clustersplitting method such as Xmeans, methods that remove a cluster that is no longer a good representation of the data distribution also exist. The method of Figueiredo and Jain Figueiredo and Jain (2002), named MMLEM, determines the value of
by fitting GMM to the data distribution with MML as the objective function. In their method using the EM algorithm, learning starts from a sufficiently large number of clusters, which are gradually annihilated during the learning process. Clusters that are not supported by most samples during the learning process, that is, clusters whose prior probability is close to 0, are removed. The objective function is minimized by the EM algorithm, and cluster annihilation and optimization are performed until no further improvement in the MML occurs. They showed that their method has lower initial value dependence than the standard EM algorithm in artificial and real data experiments.
The above greedy methods use either means or the EM algorithm as the learning method. However, means is known to be strongly dependent on the initial value and to easily converge to a local solution. One of the drawbacks of the EM algorithm is slow convergence. Although MMLEM can reduce the computational time compared to the standard EM, it remains computationally inefficient compared with methods such as Xmeans.
Examples of methods with little dependence on the initial value, yet offer fast convergence, are Kohonen’s selforganizing maps (SOM) Kohonen (1982) and neural gas (NG) Martinetz et al. (1993), which are classical vector quantization methods. Each of these two methods has its advantages and disadvantages. NG is less likely to converge to a local solution than SOM, but requires more computational time for learning than SOM.
Our objective is to construct a fast algorithm with low initial dependence for selecting a suitable number of clusters. As with MMLEM, the algorithm starts with a sufficiently large and searches for the suitable number of clusters while decreasing . Although SOM and NG are popular for clustering tasks, the computations of NG learning compared with of SOM are unacceptable for our purpose. Therefore, we selected SOM as the learning method and propose a greedy method for automatically selecting based on the SOM learning rule. We named our method the shrinking maximum likelihood selforganizing map (SMLSOM).
Because a learning algorithm based on SOM is used, it is necessary to extend the method to overcome the two obstacles inherent to SOM. First, in Kohonen’s SOM, clusters are simply constructed as sample averages, but each cluster needs to be constructed as a probabilistic model for modelbased clustering. Second, in Kohonen’s SOM, each cluster is tied to a node of a twodimensional lattice map and the structure of the map is fixed; however, it is necessary to be able to dynamically update the map structure to vary the number of clusters. Our extended SOM is based on the maximum likelihood (ML) method to overcome the first obstacle, after which the map structure is updated to overcome the second obstacle. This is made possible by using a graph structure as the map in combination with two procedures: weakly connected link cutting and unnecessary node deletion.
The remainder of this paper is organized as follows. In Section 2, we present work related to our proposed method. In section 3, we describe our probability distribution model framework and the MDL criterion discussed in this paper. In section 4, we discuss the SMLSOM algorithm, which comprises the SOM based on the ML method and the updated map structure. Section 5 reports our experimental results obtained with artificial and real data. Section 6 concludes the paper.
2 Background
2.1 Vector quantization method
2.1.1 Selforganizing map
SelfOrganizing Map (SOM) Kohonen (1982)Kohonen (2001)
is a learning model based on the concept of the structure of the human visual cortex and offers a method for projecting highdimensional data onto a lowdimensional lattice space. The lattice space of the SOM represents the topological structure of the input space discretized with
reference vectors, and it is referred to as a map. A twodimensional lattice such as a square or hexagonal lattice is usually used for the structure of the map. SOM is widely used as a method for vector quantization Gray (1984), clustering Vesanto and Alhoniemi (2000), or data visualization.
The learning algorithm of SOM is divided into two stages. First, the Euclidean distance between an input sample and a reference vector of each node , which is associated with the input space, is calculated, and the winner node with the smallest distance is determined. Second, the reference vector of each node is updated such that it closely approximates the input.

Find the winner
(1) 
Update nodes at iteration
(2)
where is the learning rate () that controls the degree of learning and is the neighborhood function that adjusts the degree of learning according to the distance on the map between node and node . These are monotonically decreasing scalar functions with respect to the number of learning iterations . Note that the Gaussian kernel function is usually used for the neighborhood function. In this neighborhood function, the winner node is updated by the largest amount, and the other nodes are updated in increasing amounts the nearer they are located to the winner node.
(3) 
where is the Euclidean distance on the map between node and node and is a monotonically decreasing scalar function with respect to and controls the degree of “nearness.”
Unlike means, the learning process of SOM entails “softtohard” learning. In the means algorithm, the input only updates its nearest node. Therefore, the input and the node are in onetoone correspondence, and the node is updated based on this correspondence relation. In contrast, in the SOM, the input and the node are in a onetomany relationship as defined by the neighbor function in the earlier stage of learning. They eventually converge to attain onetoone correspondence, increasingly resembling the means algorithm as learning progresses. The “softtohard” learning used by SOM means it is expected to be less likely to converge to local minima than means.
2.1.2 Neural gas
Neural Gas (NG) Martinetz et al. (1993) is proposed to develop the SOM model. SOM requires the map structure to be prespecified such that it is suitable to the input space. However, in general, such a structure cannot be known in advance. NG enables a more matched representation to the input distribution to be obtained by defining neighborhood relations between nodes for the input.
In the learning stage of NG, nodes are ranked in increasing order of the Euclidean distance from the input to each node (known as neighborhood ranking Martinetz et al. (1993)). Let be a ranking of node at learning step, where . Then, the reference vector is updated as follows.
(4) 
where and .
Unlike the SOM, which specifies the neighborhood relationship between nodes in advance, NG dynamically determines the neighborhood between nodes in the input space. Then, similar to SOM, the range of the neighborhood is narrowed and softtohard learning is performed. This approach obtains improved results compared to SOM.
NG necessitates to be sorted to calculate ; thus, each learning step takes for computation. Therefore, NG is computationally expensive compared with SOM.
2.2 Modelbased clustering
Modelbased clustering is performed based on the finite mixture model with components as follows McLachlan and Rathnayake (2015)Fraley and Raftery (2002)Bouveyron and BrunetSaumard (2014).
(5) 
where are the mixing probabilities that satisfy and ,
is the probability density function of
with the parameter, are the parameters of the th component and are the parameters that are necessary to specify the mixture.Given a set of independent and identically distributed samples , where is a dimensional real vector, the loglikelihood is given by
(6) 
2.3 Model selection criterion

Akaike Information Criterion (AIC) Akaike (1974) is given by
(10) where
is the degree of freedom of model parameter
and is the maximum likelihood estimator of . 
Bayesian Inference Criterion (BIC) Schwarz and others (1978) is the criterion derived from the Bayesian framework, given by
(11) 
Integrated Classification Likelihood (ICL) Biernacki et al. (2000) is the modified version of BIC for classification. BIC is known to have a tendency of successfully obtaining for the density estimation but provides an exceeded for classification. ICL is derived against this background. ICL is approximated as follows McLachlan and Rathnayake (2014),
(12) where is the membership probabilities matrix, is the probability that the th sample belongs the th component, and is the entropy of described by
(13)
Details about AIC, BIC, ICL and other information criteria have been published McLachlan and Rathnayake (2014). MDL is described in section 3.2.
3 Framework
3.1 Classification likelihood
Let be dimensional data and be a dataset containing samples. Let be a set of probability distribution models and the model parameter of the th model. Assume that each sample follows one of the models independently. Let be the model number to which belongs. Let be the collection of model parameters. Then, the likelihood we considered in this study is described by
(14) 
This is sometimes known as classification likelihood in a classification context or as the completedata likelihood within the EM framework Fraley and Raftery (2002)McLachlan and Rathnayake (2015). In this study, we estimate not only but also , which is the classification of sample . Note that an estimated value of , represented by , is a discrete value, namely, let .
3.2 Minimum description length criterion
The minimum description length (MDL) Rissanen (1978)Hansen and Yu (2001) is a model selection criterion according to which the best model is the one that can encode the given data in the most concise manner.
To clarify our situation, we describe two kinds of MDL representations. The first is based on the situation in which samples are independent and identically distributed. The second pertains to the situation in which each sample follows one of several probability distribution models. The former situation is exemplified by the finite mixture model described in section 2.2. The latter is our situation.
3.2.1 Standard MDL
Assume that each sample of a dataset follows the probability distribution independently. Consider encoding based on and send it. Let , which is the set of model structural parameters , be the model class. Let both the sender and receiver know and the common model class . In here, we suppose is a finite set, namely . The sender chooses a model from the model class, then encodes the data using the specified model. To enable the receiver to decode the received code to obtain , the sender needs to send the following three types of information: the specified model , the model parameter , and the encoded based on . The codelength to specify is and that of based on is Thomas M. Cover (2006). Therefore, the codelength of is given by
(15) 
where is the codelength of the model parameter.
According to the MDL criterion, we choose , which minimizes . Minimizing the first term on the righthand side of Eq. (15) provides the maximum likelihood (ML) estimator . However, encoding requires an infinite codelength because the ML estimator has a real value. To encode with a finite codelength, consider the following. Divide the parameter space of into a countable number of hypercubes. Then, encode the center of the hypercube into which falls, instead of itself; thus, the codelength is finite.
By calculating the asymptotic expansion of and optimizing the edge length of the hypercube, we obtain the optimized codelength of
(16) 
The third term of the above equation is unnecessary to compare models. Therefore, by eliminating it, we obtain the wellknown MDL criterion, given by
(17) 
When Eq. (17) is multiplied by 2, it coincides with the BIC of Eq. (11). Consequently, BIC and MDL are equal under the finite mixture model.
3.2.2 MDL for classification
In this clause, assume that a sample follows one of the kinds of probability distributions. Unlike the previous clause, we cannot encode all the samples without specifying the model with which each sample is encoded. Therefore, we need to encode samples and models on a onetoone basis to ensure that they correspond. Because the probability distribution of cannot be known in advance, assuming the probability of is , the codelength of is
. Then, all samples are classified into
groups using the information of and each sample group is encoded based on the corresponding model. Then, the codelength of is given by(18)  
where the codelength to specify was omitted. Minimizing Eq. (18) similar to the previous clause, the MDL criterion for classification is given by
(19)  
where , which specify the partition of samples . Minimizing Eq. (19) over every enables the optimal classification to be obtained. Consequently, the MDL of Eq. (19) also includes the assessment for the classification of samples, namely, the clustering result, unlike the wellknown MDL derived from the situation of the previous clause.
4 Algorithm
SMLSOM uses a map that comprises nodes and the links between the nodes. The node represents the parameters of the probability distribution model and the link represents that two linked nodes are neighbors. This neighborhood relationship is important for the SOM learning rule.
The algorithm has two components. The first is a “softtohard” learning step that takes probability distribution models and the map as input and learns the model parameters based on the SOM learning rule where the winner node is determined by the maximum likelihood method. The second is a step in which the map structure adapts to given data by determining which models are no longer neighbors and which models are unnecessary using the MDL described in Section 3.
4.1 SOM based on the ML method
Herein, we describe an extension of Kohonen’s SOM that assigns input samples to clusters by means of the maximum likelihood method. In this extension, a node represents one of the models , in which each sample of the dataset belongs to only one of these models . Therefore, was given, and the likelihood for each model of can be calculated. Hence, the winner node was determined as the node with the maximum likelihood for a given sample among the nodes as follows
(20) 
after which the winner node and its neighbor are adapted for based on the SOM learning rule.
The adaptation is performed as follows. In this version of SOM, we approximate the
th order moments of
,(21) 
using a stochastic approximation method Robbins and Monro (1951). Let be a sample with a th order moment where is a nonnegative integer and satisfies . Under the mean squared error criterion between and (21), the update rule is given by
(22) 
where is the learning rate at time step , and and decreases monotonically.
This moment approximation rule provides a simple parameter update rule for some probability distributions where parameters can be estimated using the method of moments. For instance, the
dimensional normal distribution,
(23)  
where is a mean vector and is a covariance matrix, based on the approximation rule (22) and using the method of moments, when a sample is assigned the parameters of the th node are updated as follows
(24)  
(25) 
where is a dimensional real vector and is a real symmetric matrix of size .
Another instance of the parameter update which Eq. (22) can provide, the case of the multinomial distribution which was used in Section 5.3, is described in Appendix B.
We named this extension of SOM the maximum likelihood SOM (MLSOM) to distinguish it from Kohonen’s SOM. Note that MLSOM is a generalization of Kohonen’s SOM. Consider a
dimensional normal distribution of which the covariance matrix is the identity matrix, then, the loglikelihood is proportional to
. Hence, the rule to find the winner node (20) becomes the process of minimizing the Euclidean distance between and . In this case, it is no longer necessary to update , and only the parameters need to be updated (24). Therefore, the MLSOM coincides with Kohonen’s SOM described in Section 2.4.2 Map structure update
The method we describe in this section to update the map structure comprises two components: disconnecting weak links and deleting unnecessary nodes.
We denote some notations to explain this method. Let be a set of undirected edges that are 2element subsets of a set of nodes , and a map is represented by a graph . The elements of represent a link that represents the neighborhood relationship between nodes.
4.2.1 Link cutting
Consider deciding whether an edge is removed. Here, we introduce measuring the weakness of a node connection by using the Kullback–Leibler (KL) divergence Kullback and Leibler (1951). Let be the KL divergence for two probability distribution models , defined as follows:
(26)  
Let be the weakness of the connection as defined by
(27) 
where is an estimator of the Kullback–Leibler divergence,
(28) 
Eq. (28) represents the quantity of the deterioration of the likelihood for each sample when all the samples belonging to node move to node . is also defined similarly.
The threshold for is used to calculate the average likelihood in each node,
(29) 
then the following rule is used to decide whether to remove the edge :
(30) 
where the parameter controls the hardness to remove edges. The threshold , which represents the worst likelihood among nodes, makes it difficult to cut edges. If many isolated nodes without edges to others exist, the SOM learning rule reduces to the simple competitive learning rule, in which case the learning process may converge to a poor local optimum. Therefore, it is preferable to retain edges as much as possible to avoid a poor local optimum.
4.2.2 Node deletion
The node deletion procedure decides whether to remove a node. An unnecessary node is determined based on the MDL described in Section 3 to remove it from a graph. For each this is determined as follows.

Evaluate the current map using the MDL (19).

Classify each sample of to a node of by the maximum likelihood method.

Using the aboveclassified samples, estimate the parameters of nodes by the method of moment.

Let the map of nodes with new parameters be a candidate map and evaluate the map using the MDL.

Compare the MDL of the selected candidate map with the current map and select the one that is more appropriate.
If the node was deleted, remove the edges to the deleted node. In addition, for each node that was adjacent to the deleted node, insert new edges between all these nodes to prevent isolated nodes from being generated for the abovementioned reason.
4.3 Complete algorithm
The algorithm of SMLSOM is shown in Algorithm 1. The initial map structure is selected as a rectangular or hexagonal lattice graph. The map can be initialized in two ways. The first approach would be to initialize all parameters randomly. In the second, only the first moments, namely the mean vectors, are initialized based on the principal component of dataset (see Appendix A) and higher moments are initialized uniformly or randomly. We recommend the second approach because this form of initialization is likely to produce similar maps even if the order of inputs is different.
MLSOM constructs clusters by the maximum likelihood method with softtohard learning, which is the SOM learning rule that uses the neighborhood relationship between nodes defined by a graph: .
The link cutting procedure removes edges between dissimilar nodes. As a result, in the updated version of the SOM, the nodes connected by the remaining links are updated to similar ones efficiently. In the node deletion procedure, when two similar nodes exist, one of the nodes is deleted such that it is merged with the other. Then, new edges are added between nodes that were adjacent to the deleted node. This procedure prevents the neighborhood relationship between nodes from changing excessively when a node is removed from a map. Thus, when this mended map is input into MLSOM, the nodes that were adjacent to the deleted node are organized. Hence, the learning process of SMLSOM to reduce the number of clusters can proceed.
5 Experiment
In this section, we clarify the effectiveness of the proposed method compared with other methods. Our experiments were organized as follows. First, we demonstrated that the learning process of SMLSOM can be used to determine the suitable number of clusters and we compare its characteristics with those of other methods using small real datasets. Second, we evaluated the estimation of , the accuracy of the clustering and computational time with SMLSOM in comparison with selected other methods. We used Xmeans Ishioka (2000), MMLEM Figueiredo and Jain (2002), and Mclust Fraley et al. (2007)Fraley and Raftery (2012). The artificial dataset used in the simulation was generated by the MixSim package^{1}^{1}1MixSim: Simulating Data to Study Performance of Clustering Algorithms. Available at https://cran.rproject.org/web/packages/MixSim/index.html Melnykov et al. (2012) in R, which can vary the overlap between clusters. The experimental results showed that SMLSOM delivers high performance within short computational time. Finally, to exemplify the applicability of the proposed method, we applied SMLSOM to a document clustering task. This task requires methods with the ability to extract a piece of meaningful information hidden among given data for applications such as topic modeling Blei (2012). If a method selects correctly but an uninterpretable result is obtained, it is not useful for applications. Therefore, we analyzed the clusters produced with SMLSOM and discussed whether the clusters could be interpreted and whether they contained the appropriate data.
5.1 Demonstration of the algorithm using real data
In this section, we present a demonstration of the proposed method using a real dataset and compare it with existing methods regarding the selection of a model for clustering.
We consider the Old Faithful dataset to fit bivariate Gaussians with full covariance matrices. For comparison, we ran SMLSOM and Mclust using this dataset as input. Then, we compared the results we obtained with those reported previously by other workers. For SMLSOM, we started with a hexagonal lattice (i.e., ) using PCA initialization (the covariance matrix was the identity matrix), , and the settings described in section 4. For Mclust, we used to and evaluated the result by BIC/ICL. The default initialization method of the EM algorithm in Mclust is based on modelbased hierarchical agglomerative clustering (see Fraley et al. (2007), Fraley and Raftery (2002)). Fig. 1 shows the intermediate estimates and the final estimate () produced by SMLSOM. The figure shows the “shrinking” process of the map as the map structure is updated by deleting a node and cutting links. With SMLSOM, was selected 99 times; hence, two clusters were strongly supported. With Mclust, similar to the result of SMLSOM, two components were selected with both BIC and ICL. The top three BIC results were , , and . Similarly, the ICL results were , , and . Therefore, the result of Mclust also supported two clusters for the dataset. On the other hand, the results obtained with the Bayesian inference method Richardson and Green (1997), contrast the abovementioned results which supported two clusters. That is, Dellaportas and Papageorgiou (2006) reported that the posterior probability of three components is the highest (0.5845) and two components are the 2nd highest (0.3035), and another study Komárek (2009) found that three and two components have almost the same posterior probabilities. For the clustering task, however, we considered two clusters to be sufficient for the Old Faithful dataset. The Bayesian inference method Richardson and Green (1997), of which the main purpose is density estimation, does not always conform to our situation, namely model selection for clustering.
5.2 Simulation with artificial data
We used some existing methods for the experiment. Mclust is an implementation of the EM algorithm for Gaussian mixtures in FORTRAN. The FORTRAN code of Xmeans^{2}^{2}2The FORTRAN code is available at http://www.rd.dnc.ac.jp/~tunenori/xmeans_e.html, accessed 2017/3 Ishioka (2000) and the MATLAB code of MMLEM^{3}^{3}3The MATLAB code is available at http://www.lx.it.pt/~mtf/, accessed 2017/3 Figueiredo and Jain (2002) by each author were used in our experiment and are publicly available. In general, MATLAB is inferior to C and FORTRAN in terms of the computational time to complete a loop procedure; thus, we converted the MATLAB code into C to enable us to compare the computational time with other FORTRAN and C implementations of the methods.
We generated an artificial dataset using MixSim, which considers the following:
(31) 
where is the misclassification probability that sample generated from the th component was classified mistakenly to the th component and is defined similarly. The overlap between two components is defined by
(32) 
We can specify , the average of , and generate datasets using MixSim. The procedure MixSim uses to generate the data corresponding with the overlap is as follows Melnykov et al. (2012):

Mean vector is sampled from a
dimensional uniform distribution and covariance matrix
is obtained from the standard Wishart distribution with parameter and degrees of freedom. Note that, the user can specify the structure of covariance matrices as being either spherical or nonspherical, and heterogeneous or homogeneous. If the spherical structure is specified, the nondiagonal elements of are set to 0, and the diagonal elements of are taken from the standard uniform distribution. If the homogeneous structure is specified, set , where is generated by either of the aforementioned two methods. 
The covariance matrices are multiplied by a positive constant , after which the value of that minimizes the difference between the userspecified and is determined by the current multiplier .
We generated samples from the GMM with spherical and heterogeneous covariance matrices using MixSim. The sample was variate and the number of samples was . The number of components was and the prior probability of each component was . We set to 10 different values and created 100 sets of samples with the abovementioned condition for each . Then, each method was run 10 times for each sample set. Thus, 1,000 values of were produced by each method for each .
The settings of each algorithm were as follows. The initial parameters were randomly initialized. All four the methods were for full covariance matrices. Xmeans and MMLEM started with . SMLSOM started with a hexagonal lattice (i.e., ) was set to 15, and the other parameters were assigned the default values of the Kohonen package^{4}^{4}4kohonen: Supervised and Unsupervised SelfOrganizing Maps. Available at: https://cran.rproject.org/web/packages/kohonen/index.html in R. Mclust was applied to each of to and its result was evaluated by BIC and ICL.
We evaluated the results of the four methods from three points of view: the accuracy of the estimation of , the accuracy and stability of clustering, and the computational time. We used the adjusted rand index (ARI) Hubert and Arabie (1985) to evaluate the accuracy of clustering. Note that the calculation of ARI is described in Appendix C. The stability was evaluated by determining the standard deviation of the ARIs. Note that the computational time of Mclust was the total number of times needed to decide .
We ran all four of these methods on a computer running Vine Linux 6.3 64 bit, Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93 GHz, 4.0 GB memory.
Fig. 2 shows the accuracy of estimation of by each method. The result reveals that SMLSOM is superior than the other methods with regard to selecting . Fig. 3 shows the accuracy and stability of clustering by each method. The result shows that SMLSOM and MMLEM have the highest accuracy among the methods. In addition, the two methods have higher stability than the other methods. This reveals that the two methods are less dependent on the initial parameter. Although the performance of SMLSOM was superior to that of MMLEM when the cluster overlap was low, with a larger overlap SMLSOM was equivalent or slightly inferior to MMLEM. A large value of indicates that two clusters are almost mixed and indistinguishable from each other. Therefore, any attempt to partition such clusters into two subgroups has a large misclassification rate. This means that merging rather than partitioning two clusters such as these into one cluster may improve the ARI. Actually, when the overlap is large, MMLEM selected stably whereas SMLSOM selected
frequently but with high variance.
Fig. 4 shows the computational time required by each method. The result reveals that the EM algorithm (Mclust) is the slowest among the four methods. Compared with the EM, MMLEM is approximately times faster, SMLSOM is approximately times faster, Xmeans is, roughly speaking, approximately times faster and this is the fastest method among the four methods.
Xmeans is an extremely fast method; however, it almost failed to estimate when the cluster overlap was large. The accuracy and stability of clustering was therefore low in this experiment. MMLEM succeeded in avoiding the drawbacks of EM, i.e., initial parameter dependence and slow convergence. However, the computational time remains large compared with SMLSOM and Xmeans. SMLSOM achieved high accuracy and stable clustering, which were comparable to the performance of MMLEM. In addition, SMLSOM achieved high performance with lower computational time compared with the EMbased method.
5.3 Document clustering
course  faculty  project  student  

Cornell  44  34  20  128 
Texas  38  46  20  148 
Washington  77  31  21  126 
Wisconsin  85  42  25  156 
Misc.  686  971  418  1083 
course  faculty  project  student  

A  96  38  328  43 
B  30  158  23  286 
C  7  570  125  155 
D  780  14  9  43 
E  8  25  5  589 
F  9  319  14  525 
No.  A  B  C  D  E  F 

1  research  DDDD  DDDD  D  page  DDD 
2  project  computer  computer  DD  home  computer 
3  DD  DDD  research  DDD  computer  university 
4  computer  science  university  DDDD  DDDD  science 
5  systems  DD  DDD  :TIME:  DD  :PHONENUM: 
6  page  university  systems  D*  university  DDDD 
7  information  page  DD  page  science  page 
8  DDDD  research  science  computer  homepage  home 
9  D  D  D*  information  DDD  department 
10  group  home  professor  office  links  research 
11  system  department  computing  class  D  phone 
12  science  D*  D  homework  information  office 
13  home  :PHONENUM:  department  programming  student  DDDDD 
14  programming  information  parallel  home  department  D 
15  university  systems  engineering  assignments  resume  DD 
16  software  interests  :PHONENUM:  hours  stuff  D* 
17  data  csDDD  information  fall  research  information 
18  projects  software  software  assignment  personal  student 
D denotes an arbitrary digit, D* denotes digits of an arbitrary length 
:TIME: and :PHONENUM: denote the time and phone number. 
resume  sites  music  favorite  pictures  sports 
india  stuff  china  cool  friends  school 
chinese  bookmarks 
A  B  C  D  E  F  

music  0.02  0.09  0.02  0.00  0.11  0.04 
sports  0.01  0.04  0.01  0.00  0.09  0.02 
cool  0.03  0.09  0.01  0.01  0.12  0.02 
favorite  0.02  0.10  0.02  0.01  0.14  0.03 
We used WebKB^{5}^{5}5http://www.cs.cmu.edu/~webkb/ for the experiment. The WebKB dataset is a web page collection obtained from four universities, namely Cornell, Texas, Washington, and Wisconsin, as well as from the computer science departments of other universities. The dataset is often used for document classification and document clustering Nigam et al. (2000)Bouguila (2008). The total number of documents is . Each document is classified into one of seven classes: “student,” “faculty,” “staff,” “course,” “project,” “department,” and other. In this experiment, we used the classes “course,” “faculty,” “project,” and “student,” containing pages, i.e., the same as before Bouguila (2008). Note that “course” refers to pages about lectures, “faculty” and “student” are teachers’ or students’ homepages, and “project” refers to pages about research projects. The number of pages for each of the universities and the classes are listed in Table 1.
We transformed the documents into a documentterm matrix before clustering it. Note that the class labels were given to these documents in advance. Therefore, the purpose of clustering in this experiment was to extract the characteristics of the documents based on the given labels. In addition, we could not ignore the possibility of the existence of a hidden cluster among the documents we would not have been able to capture by using the given labels.
We performed experiments using the dataset in the same way as before Bouguila (2008). We selected the top 300 words as measured by average mutual information with the class variable, and we removed stop words and words that occurred fewer than 50 times. These experiments were conducted using the Rainbow package McCallum (1996).
We ran SMLSOM for the multinomial distribution model. For comparison, we also ran EM for the mixture of multinomial distributions using the mixtools package^{6}^{6}6mixtools: Tools for Analyzing Finite Mixture Models. Available: https://cran.rproject.org/web/packages/mixtools/index.html Benaglia et al. (2009) in R. The settings for each method were as follows. The parameters were randomly initialized. SMLSOM was run 10 times with and a hexagonal lattice. We evaluated its 10 results by the MDL and selected the best of those. EM was applied 20 times to each from to , after which its results were evaluated by BIC/ICL and the best of those were selected.
We compared the ARI of the clustering results obtained by using EM and SMLSOM with the abovementioned settings. Note that the ARI was calculated as true classes, which were the abovementioned four classes. In SMLSOM, was selected and the ARI of the selected clustering was 0.3216. The BIC/ICL evaluation of the results for EM is shown in Fig. 5. Both figures show that was the best clustering result, for which an ARI of 0.2670 was obtained.
This result obtained for EM with BIC/ICL is four times the number of given classes. However, the existence of such a large number of hidden classes in these data, in addition to the four given classes, is unlikely. Actually, the ARI evaluation indicated that the result provided by EM was inferior to that of SMLSOM.
Next, we analyzed the clusters provided by SMLSOM. Table 2 lists the number of documents of each class in a cluster. The table shows that cluster A collected many of the pages of “project,” cluster D collected many of the pages of “course,” and clusters B, C, E, and F collected many of the homepages. Especially, clusters B, C, and F all contained pages of “faculty” and “student,” with cluster C also containing pages of “project.”
To examine the characteristics of the clusters, the top 15 most frequent words in each cluster are included in Table 3. According to the table, in cluster D, which collected the most pages of “course,” a 13digit number and a string denoting the time appear frequently. This is because words about office hours, lecture schedules, and course numbers appear in the course pages frequently. In addition, words such as “homework” and “assignment” also appear but not in other clusters. These words characterize the “course” class. In cluster A, which collected many of the pages of “project,” its own class label, “project(s),” is positioned higher in the table. In addition, words likely to relate to topics of research in computer science such as “system(s),” “software,” and “data” also appear. In the remaining clusters B, C, E, and F, in which many of the homepages were collected, words such as “computer,” “science,” and “university” appear higher in the table, because these words are used for selfintroduction. Although clusters B, C, E, and F have some words in common, these clusters also have differences, as seen in the table. In cluster C, which collected the pages of “faculty,” the word “professor” appears in the table, and it would seem that the pages in cluster C were written by a person who is of the professor class or who was taught by such a person. In addition, the word “research” appears higher in the table, presumably because many persons of cluster C prefer to introduce their own research on their own homepage. In clusters B and F, which collected the pages of “student,” persons in these two clusters are likely to be graduate students because the word “research” appears higher in the table similar to the case of cluster C. Especially, persons in cluster F may be Ph.D. students who have their own office because terms such as “office” and “:PHONENUM:” appear in the table. Actually, an investigation of the ratio of documents that include the term “PhD,” produced the following result: A: 0.038, B: 0.123, C: 0.110, D: 0.006, E: 0.035, F: 0.113. This indicates that clusters B, C, and F contain more pages of Ph.D. students than cluster E. On the other hand, the characteristics of cluster E were examined by considering the words that only appear in this cluster among the 300 words that enumerated 50 highappearance words in every cluster. In this way, 14 words were obtained as shown in Table 4. Words related to hobbies such as “music,” “sports,” and the expression of preferences such as “cool” and “favorite” are included. This result suggests that many persons in cluster E prefer to introduce their own interests such as hobbies and favorite things. The results of an actual investigation of the ratio of documents that include these words are provided in Table 5. Any of the words in the table appear in approximately of the documents in cluster E. Next to cluster E, cluster B contains many documents that include these words; thus, the homepages of cluster E and cluster B are similar in their purpose which is to introduce own hobbies and favorite objects.
Therefore, the results can be summarized as follows. Clusters B, C, E, and F, in which the selfintroduction pages classified into “faculty” or “student” were collected, were divided into E and B, C, F, where E collected nonresearchers’ pages and B, C, and F collected researchers’ pages. Clusters B and E contain many homepages that introduce own preferences on an own page. Therefore, the researchers in cluster B can also be considered to prefer to introduce something of their own such as a hobby and favorite object, similar to the students in cluster E. On the other hand, comparing cluster C with cluster F, the persons in cluster C may be a professor or the person supervised by a professor and the cluster contains many pages about their research, whereas the persons in cluster F may be Ph.D. researchers or students who have their own office and who provided their contact information on their own homepage.
The results we presented above enabled us to conclude that SMLSOM can find hidden clusters that cannot be captured only by using the four labels given in the documents.
6 Conclusion
In this paper, we proposed a greedy algorithm named SMLSOM to select the number of clusters . The algorithm can be easily implemented and is applicable to any probability distribution model as long as the distribution function can be calculated by the method of moment. SMLSOM constructs clusters by decreasing and by using the learning rule of SOM. The algorithm then updates the graph structure of the graph that connects the probability distribution model to a node. We showed that the dependence on the initial value can be reduced and that a model consistent with the purpose of clustering can be chosen compared to the existing method.
The proposed method is fast, easy to implement, and performs the clustering task effectively. The experimental results confirmed the ability of the proposed method to achieve highly accurate and meaningful clustering and that it is computationally efficient. This result indicates the applicability of our method to various realworld tasks. We additionally plan to test its effectiveness on various types of real data.
Appendix A Initialization
Let ; the initial reference vector of node , , is calculated as
(33) 
where , and
are eigenvectors and eigenvalues of the 1st and 2nd principal components of
, respectively. Further, constitute a sequence of numbers from to with a common difference; they are given by(34)  
(35) 
Appendix B Multinomial model
Suppose follow a multinomial distribution. The probability function is given by
(36) 
where . The 1st moment of the multinomial distribution is given by
(37) 
The procedure to update the parameters for the multinomial model in MLSOM is expressed as follows, based on Eq. (22):
(38) 
Note that, if , let
Appendix C Adjusted Rand Index
Let be the set of indices of samples. Let and be two different partitions of , where and are subsets of and satisfy the following: . and .
Considering a sample pair and the following calculation,
(39)  
(40)  
(41)  
(42) 
then the Rand Index (RI) is given by
(43) 
The ARI is then defined as
(44) 
where is the expected value of the RI when the two partitions and are independent, given by
(45)  
References
 A new look at the statistical model identification. IEEE transactions on automatic control 19 (6), pp. 716–723. Cited by: §1, item 1.
 mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software 32 (6), pp. 1–29. Cited by: §5.3.
 Assessing a mixture model for clustering with the integrated completed likelihood. IEEE transactions on pattern analysis and machine intelligence 22 (7), pp. 719–725. Cited by: §1, item 3.
 Probabilistic topic models. Communications of the ACM 55 (4), pp. 77–84. Cited by: §5.
 Clustering of count data using generalized dirichlet multinomial distributions. Knowledge and Data Engineering, IEEE Transactions on 20 (4), pp. 462–474. Cited by: §5.3, §5.3.
 Modelbased clustering of highdimensional data: A review. Computational Statistics & Data Analysis 71, pp. 52–78. Cited by: §2.2.
 Multivariate mixtures of normals with unknown number of components. Statistics and Computing 16 (1), pp. 57–68. Cited by: §5.1.
 Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1–38. Cited by: §1.

Clustering: a neural network approach
. Neural Networks 23 (1), pp. 89 – 107. Note: External Links: ISSN 08936080 Cited by: §1.  PGmeans: learning the number of clusters in data. In Advances in neural information processing systems, pp. 393–400. Cited by: §1.
 Unsupervised learning of finite mixture models. IEEE Transactions on pattern analysis and machine intelligence 24 (3), pp. 381–396. Cited by: §1, §5.2, §5.
 Modelbased methods of classification: using the mclust software in chemometrics. Journal of Statistical Software 18 (6), pp. 1–13. Cited by: §2.2, §5.1, §5.
 Modelbased clustering, discriminant analysis, and density estimation. Journal of the American statistical Association 97 (458), pp. 611–631. Cited by: §2.2, §2.2, §3.1, §5.1.
 MCLUST version 4: an R package for normal mixture modeling and modelbased clustering. Technical report Technical Report 597, Department of Statistics, University of Washington. Cited by: §5.
 Vector quantization. IEEE Assp Magazine 1 (2), pp. 4–29. Cited by: §2.1.1.
 Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 (4), pp. 711–732. Cited by: §1.

Learning the k in kmeans
. In Advances in neural information processing systems, pp. 281–288. Cited by: §1.  Model selection and the principle of minimum description length. Journal of the American Statistical Association 96 (454), pp. 746–774. Cited by: §3.2.
 The dip test of unimodality. The Annals of Statistics, pp. 70–84. Cited by: §1.
 Comparing partitions. Journal of Classification 2 (1), pp. 193–218. External Links: ISSN 14321343 Cited by: §5.2.
 Extended kmeans with an efficient estimation of the number of clusters. In International Conference on Intelligent Data Engineering and Automated Learning, pp. 17–22. Cited by: §5.2, §5.
 Data clustering: 50 years beyond Kmeans. Pattern recognition letters 31 (8), pp. 651–666. Cited by: §1.
 Dipmeans: an incremental clustering method for estimating the number of clusters. In Advances in neural information processing systems, pp. 2393–2401. Cited by: §1.
 Selforganized formation of topologically correct feature maps. Biological cybernetics 43 (1), pp. 59–69. Cited by: §1, §2.1.1.
 SelfOrganizing Maps. 3rd edition, Springer. Cited by: §2.1.1.
 A new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the number of components and intervalcensored data. Computational Statistics & Data Analysis 53 (12), pp. 3932–3947. Cited by: §5.1.
 On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §1, §4.2.1.
 Neuralgas’ network for vector quantization and its application to timeseries prediction. IEEE transactions on neural networks 4 (4), pp. 558–569. Cited by: §1, §2.1.2, §2.1.2.
 Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. External Links: Link Cited by: §5.3.

Mixture models for standard pdimensional Euclidean data.
In
Handbook of Cluster Analysis
, C. Hennig, M. Meila, F. Murthag, and R. Rocci (Eds.), Chapman & Hall/CRC Handbooks of Modern Statistical Methods, pp. 145–172. Cited by: §2.2, §2.2, §3.1.  On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 (5), pp. 341–355. External Links: ISSN 19424787 Cited by: item 3, §2.3.
 MixSim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software 51 (12), pp. 1–25. Cited by: §5.2, §5.
 Text classification from labeled and unlabeled documents using em. Machine learning 39 (23), pp. 103–134. Cited by: §5.3.
 Xmeans: extending kmeans with efficient estimation of the number of clusters.. In ICML, Vol. 1, pp. 727–734. Cited by: §1.
 On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B (statistical methodology) 59 (4), pp. 731–792. Cited by: §1, §5.1.
 Modeling by shortest data description. Automatica 14 (5), pp. 465–471. Cited by: §1, §3.2.
 A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §4.1.
 A review of clustering techniques and developments. Neurocomputing 267, pp. 664–681. Cited by: §1.
 Estimating the dimension of a model. The annals of statistics 6 (2), pp. 461–464. Cited by: §1, item 2.
 Elements of information theory. 2nd edition, John Wiley & Sons. Cited by: §3.2.1.
 Clustering of the selforganizing map. IEEE transactions on neural networks 11 (3), pp. 586–600. Cited by: §2.1.1.
 Estimation and inference by compact coding. Journal of the Royal Statistical Society. Series B (Methodological), pp. 240–265. Cited by: §1.
 An information measure for classification. The Computer Journal 11 (2), pp. 185–194. Cited by: §1.