The growth of large-scale datasets and diversity of data brings an urgency to the development of analytic methods that can handle high volume and dimensionality as well as data that include a mixture of categorical and numerical attributes. Approaching analysis from a probabilistic perspective wherein data is represented as a high-dimensional mixture model provides a transparent representation and a tool that supports common operations such as clustering, anomaly detection, and classification.
While mixture models are a powerful tool, they are often employed for numerical data where mathematical functions, such as multivariate Gaussians, can be used. Each mixture component concisely captures the contribution of a dense region in the high-dimensional space to the distribution as a whole.
In this paper, we present an new algorithm, the GenerALIzed Low-EntrOpy mixture model (GALILEO), to extend mixture models to categorical attribute space using a new definition of component density that applies to categorical data. Our concept of categorical density remediates the lack of a natural distance metric in categorical space [boriah2008similarity] and contributes to building mixture models with high-density components that represent natural clusters.
The proposed approach involves starting with a high number of initial components and using an annealing process to iteratively remove low-density components. This procedure results in high-density/low-entropy distributions that accurately fit the data. In each step of the process, an expectation-maximization (EM) algorithm is used to generate a fit to the data; pruning of low-density components is then performed using an entropy-based density metric.
We demonstrate that this process generates an optimal solution with respect to the density metric for the mushroom dataset as well as producing comparable state-of-the-art results on other datasets commonly used in the literature.
GALILEO is easily parallelizable and scales as for data points to generate a distribution with mixture components, making it suitable for use on large datasets. Implementation and testing of the algorithm has been done on SOCRATES, a scalable analytics platform developed at JHU/APL [savkli2014socrates].
This paper is organized as follows: In the next section, we introduce the concept of a generalized density metric that provides the key ingredient of the algorithm. In Section III, we present a generalization of mixture model that leverages the density metric. Section IV describes the procedure for determining the optimal number of clusters. Then, we review similar algorithms in Section V. In Section VI, we present test results on various commonly used datasets.
Ii A Generalized Density Metric
One of the challenges in categorical space is the evaluation of the quality of a mixture component. In numerical space, a natural measure for the quality of a component is provided by the variance of the distribution; high-variance components represent sparsely populated regions of space.
When a mixture model is initialized with components far from high-density regions, the EM process steers the components towards regions with higher density to eventually find a reasonable solution. In categorical space, the EM process is hindered by a lack of analytic representation that can leverage features of the distribution. Combined with the lack of a component center and a universal distance metric, the EM process can lead to poor results by converging to sub-optimal distributions. To remedy this problem, we follow an approach that starts with a high number of components in the mixture model and uses a fitness criterion and pruning process to remove low-quality components.
The fitness criterion used in pruning of low-quality components is given by a generalized density metric. Consider, for example, the following one-dimensional distributions:
Whereas a naïve Cartesian density metric, defined as number of particles per unit length, for these two distributions is identical, the distribution on the right is clearly not as “dense” as distribution on the left – i.e., the distribution on the right is more uniform than the one on the left. We therefore propose an effective length for the axis using the entropy of the distribution,
With this definition for the effective length, the length of these distributions is given by and , therefore the densities () are .
As this simple example illustrates, the density definition indeed favors the left distribution. This definition of density (Eq. 1
) applies to numerical data as well as categorical data. For example, in a Gaussian distribution, the exponentiation of the differential entropy is proportional to the standard deviation of the distribution[huber2008entropy], i.e., .
It is possible to show that many distributions also have a similar relationship between standard deviation and entropy (e.g.
for an exponential distribution andfor a Laplace distribution). Extending the entropy-based effective length specification to higher dimensions, the entropy-based effective volume of a hyper-cube in attribute space, with attributes, can be defined as
which leads to a definition of a generalized density in higher dimensions,
Although it is possible to use the definition of density given in Eq. 4
for both categorical and numerical variables, the numerical subspace requires some care in how entropies are defined. If a multivariate distribution has a high degree of correlation between its variables, treating variables as independent leads to an over-estimation of the effective volume as off-diagonal regions are sparsely populated. Therefore, it is more appropriate to define the volume of the numerical subspace in terms of entropies along the principal axes defined by Principal Component Analysis (PCA)[pearson1901principal]. Looking to the relationship between entropy and standard deviation for guidance, the entropies of numerical subspace along principal components can be estimated using
represent eigenvalues of the covariance matrix for numerical attributes.
Definition of the entropy-based density implies that a uniform distribution leads to a density ofindependent of the size and shape of the cube for data without duplicates. Furthermore, in this case of data without duplicate points, the density is bounded by , a constraint that follows from Shannon’s entropy inequality. For a cube of particles without duplicates, the joint entropy, , is given by
is the probability of an individual particle.
The entropy of a multivariate distribution follows the inequality
Iii Generalized Mixture Model
A mixture model is defined by a superposition of probability distributions forcomponents,
where each component distribution, , is subject to the normalization condition,
and the components priors determine the relative size of each components,
The individual component distributions can be modeled by any suitable distribution depending on the problem and types of data attributes involved. When the attributes are all categorical, a high-dimensional nonparametric distribution based on a clique tree may be used [savkli2016bayesian] to estimate the full joint probability. Such a distribution is a good option when data has sub-spaces where attributes are highly correlated. However, since individual component distributions are not required to model the entire space, but only a dense region, a complex structure such as a clique tree for individual components is not necessary. Correlations within dense regions are much less significant and a naïve assumption of attribute independence inside a component is typically sufficient. The fact that a mixture model comprises many components captures the structure of correlations that a clique tree represents. Therefore, the naïve probability for a data point with attributes to be a member of a component is given by
Each component contains a discrete probability distribution for each attribute of the dataset. Numerical attributes may be considered at this time by discretizing them and treating them as categorical. Alternatively, numerical subspaces can be represented using a multivariate distribution as is done in the Gaussian Mixture Model. However, our focus in this paper is the generalized categorical mixture model. The potential benefits of annealing using an entropy-based density metric for numerical data will be considered in future work.
GALILEO starts by initializing the mixture model with a large number of components, . The initialization of the components is performed by generating random component “centers” according to the global distribution of the data. Since initial components need a probability distribution (and a single center point does not provide that), we use an equally-weighted average of the global distribution with the randomly generated component center. In other words, each component starts with the probability distribution given by all data points plus a random center inserted a further times.
Following creation of the initial components, GALILEO will then iteratively:
Use expectation-maximization to fit the distribution to data at a given ,
Sort the mixture components using the density metric (Eq. 4),
Prune the lowest-density components,
until the number of components has been reduced to
. The EM algorithm in Step 1 evaluates component memberships in a probabilistic manner, assigning each data point fractionally to each component. This fractional assignment is given by the posterior probability of a measurement belonging to a component, which follows from Bayes’ theorem as
An optimal solution for is then selected using an optimality criterion as described in Section IV.
A detailed description of the steps of the algorithm is provided in Fig. 4.
This method is similar to the finite mixture model used by [meek2002learning, cheeseman1988bayesian], with further details in [titterington1985statistical, mclachlan1988mixture]. However, the introduction of the denisty metric and the procedure for optimizing make GALILEO a unique application of this model.
Iv Search for Optimal
There are general rules of thumb about the relationship between the optimal number of components, , and the number of data points, . However, in general, it is not possible to make a definitive statement about such a relationship. From our experiments, we find that should be picked such that it is at least twice as large as the expected optimal number of components, . This choice gives the annealing process the opportunity to converge to the optimal solution consistently. Larger values of will not affect the value of , but will take longer to converge due to more steps being required.
In practice, we use the following procedure to step down from . By choosing a parameter, , we then inspect the set of values that are defined by the relationship,
where and . The parameter determines how finely the optimization of number of components is performed. Using such a rule, the number of possible mixture models inspected scales as .
For each value of , an EM procedure is performed to converge to a solution using available components at the level. Next, the quality of the mixture model solution is measured. Two commonly used metrics for model selection are the Akaike Information Criterion (AIC, [akaike1998information]) and Bayesian Information Criterion (BIC, [schwarz1978estimating]). These are given by
is the degrees of freedom of the model. A detailed description and comparison of these metrics is given by[vrieze2012model, aho2014model]. In addition to the AIC and BIC, we also evaluate the size-weighted average density of the components,
and is given by Eqn. (13) with and initially as we start with components. Whereas one would seek to minimize the AIC or BIC, we wish to maximize the density, , of the mixture model. The density measure, has the benefit over the AIC and BIC in that it scales only with the number of clusters, number of attributes, and cardinality of attributes – there is no dependence on the dataset size so it will be simpler and faster to compute than likelihood-based metrics. In later examples, we compare using each of these three criteria to determine , finding that they agree in certain cases. The choice of which to use may be data-dependent and is up to the user to choose.
V Relevant Literature
To date, most of the work in the realm of clustering algorithms has been focused on the realm of numerical data [aggarwal2013data, fahad2014survey]. However, there has been some work done in regard to the clustering of categorical and mixed data. In this respect, there are a handful of algorithms that represent the state of the art, namely ROCK and COOLCAT. DBSCAN is a numerical clustering algorithm that uses a density notion similar to that of GALILEO. In this section we will briefly review each of these algorithms. In [Liang:2012], the authors present a review of the clustering literature and propose a different entropy-based method for determining optimal cluserting of mixed data. Due to space constraints, further comparisons with other algorithms are deferred to future work.
ROCK [guha2000rock] is often used a benchmark for the quality of a categorical clustering algorithm. ROCK first computes the Jaccard coefficient between all pairs of data points. By then applying a threshold, , to these coefficients, ROCK assigns each data point a list of “neighbors” and computes the matrix , the number of common neighbors shared by points and . ROCK then agglomeratively finds clusters that maximize the criterion function,
where is a cluster fitness function chosen by the user that depends on the data and type of cluster desired. While ROCK has been shown to produce high-quality results, it suffers from a poor worst-case complexity of . Additionally, it requires the user to tune the algorithm to the data through the choice of both the thresholding parameter and the fitness function .
COOLCAT [barbara2002coolcat] uses the notion of entropy as the means to cluster the data. The algorithm begins by selecting samples that collectively have the highest entropy. These points will be the initial cluster centers. COOLCAT then proceeds by adding each sample in the dataset to the cluster that will result in the smallest increase in entropy.
As a result of this sequential process, COOLCAT is sensitive to the ordering of the data. In order to limit this sensitivity, the data is processed in batches and a re-clustering procedure is performed after each batch. This procedure takes some fraction of the most poorly-fit data points and reassigns them to the clusters.
Even with the re-clustering procedure, COOLCAT results are strongly dependent on the ordering of the data. Moreover, the process of choosing the initial clusters is where is some representative sample of , limiting COOLCAT’s effectiveness for large datasets.
DBSCAN [ester1996density], much like GALILEO, uses a notion of density in order to find clusters of points. Unlike the methods covered to this point, DBSCAN is strictly for use on numerical data, requiring a distance metric to calculate distances between points in the data. The authors have even extended DBSCAN to cluster spatially extended objects like polygons [sander1998density]. The algorithm is able to automatically find the number of clusters as well as find clusters of arbitrary shape.
In this section we will present the results of GALILEO’s clustering on a few publicly available datasets. GALILEO clusters by assigning each data point to its most probable component in the components in the optimal mixture model, . We first describe the datasets to be used and then demonstrate GALILEO’s performance, including some comparisons to other algorithms mentioned previously.
Vi-a Experimental Datasets
Vi-A1 Congressional Votes
The Congressional votes dataset111https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records, votes
, is from the UCI Machine Learning Repository[Lichman:2013]. This dataset consists of the 1984 voting history of each member of Congress with respect to different issues. Each member of Congress is assigned binary (yes/no) vote attributes as well as a classification label (Republican or Democrat). The dataset contains Democrats and Republicans. The classification label was ignored for the purpose of clustering so that it could be used as an independent measure of clustering results.
We have also benchmarked our code using the mushroom dataset222https://archive.ics.uci.edu/ml/datasets/Mushroom from the UCI Repository [Lichman:2013]. This dataset contains the physical properties of gilled mushrooms from species in the Agaricus and Lepiota family, as well as their edibility. In addition to the binary edibility, there are other categorical attributes, each admitting up to twelve possible values. These attributes describe various properties such as color, odor, and shape. All attributes were used for clustering in order to be consistent with the procedure of the ROCK paper [guha2000rock].
Another standard categorical dataset, soybean333https://archive.ics.uci.edu/ml/datasets/Soybean+(Large), consists of classes each with categorical attributes. This dataset categorizes the properties of various types of diseases in soybeans [michalski1980learning]. It was also obtained from the UCI Machine Learning Repository [Lichman:2013].
The zoo dataset444https://archive.ics.uci.edu/ml/datasets/Zoo consists of different attributes related to each of species of animal. These attributes represent, for example, how many legs an animal has or if it has feathers. This dataset was also obtained from the UCI Machine Learning Repository [Lichman:2013].
In order to test data of various sizes, we used datgen [melli1999datgen] to generate categorical datasets of arbitrary size. These datasets were generated using a set of rules to cluster records in the attribute space. In our tests, each record had attributes with possible values constrained by one of five rules.
Vi-B Experimental Results
In order to show in detail how the algorithm works, we use the mushroom dataset (Section VI-A2). Fig. 5 shows the AIC, BIC, and density curves produced by GALILEO when clustering this dataset. All three metrics agree that , although there are visible differences in how clear this selection is.
It is worth noting that for . Recall that for data that has no duplicates, the theoretical upper limit on the density of a cluster according to the inequality given by Eq. 8d is . Interestingly, this optimal solution corresponds to clusters that contain only all edible or all poisonous mushrooms. Furthermore, the clusters corresponds exactly with the number of species of mushrooms represented in the dataset (unfortunately, the species identification is not in the dataset so we are unable to perform a direct comparison). Reaching the maximum average density of in a generic clustering problem when data is categorical is clearly not always achievable.
The role of density in obtaining this result can be understood by changing the pruning criteria from our entropy-based density to a naïve Cartesian density. Fig. 6 demonstrates that when a simplistic Cartesian density is used it is not possible to reach an optimal result, instead finding that . Whereas the results are comparable for high values of , the Cartesian density is less able to determine which clusters are best to prune as the number of clusters begins to approach .
Another consideration in the execution of the algorithm is the choice of . The results shown in Fig. 7 illustrate that as long as , the annealing process converges to the same optimal result, . If is set lower, the annealing process does not have sufficient time to converge to the optimal solution. We observe a similar behavior on other datasets tested and in general find that using a starting point that has at least twice the expected number of clusters is a good rule of thumb to reach an optimal solution.
Vi-C Comparison to Other Algorithms
We now compare GALILEO’s results to that of other commonly-used categorical clustering algorithms (Sec. V). In evaluating the quality of our clustering results we will make use of the Category Utility function [categoryutility, corter1992explaining, Fisher1987], . This function provides a measure of the predictive advantage gained with knowledge of the clustering relative to without that knowledge.
implements an outlier removal scheme, they report fewer total members for each cluster than we do. While this comparison is not exact, we have, however, demonstrated an improvement in the clustering when compared to the traditional centroid-based hierarchical clustering algorithm[duda1973pattern, jain1988algorithms] that ROCK used as a baseline.
|GALILEO||ROCK||Trad. Hier. Method|
On the mushroom dataset, GALILEO finds roughly the same clusters as ROCK, with the only exception being that GALILEO naturally converges to clusters as opposed to with ROCK [guha2000rock]. These extra clusters result from splitting two of the ROCK clusters, including the one with mixed edibility. GALILEO identifies no clusters with mixing in the edibility attribute.
Finally, in Table III we report our results for the various datasets from the UCI Machine Learning repository. It is noteworthy that these values show comparable results for to the results of COOLCAT (Sec. V-B), obtained using the coolcat-r package555https://github.com/clbustos/coolcat-r (except for the mushroom dataset – marked with –, which we obtain from [barbara2002coolcat] and normalize by an assumed clusters).
Vi-D Scaling Results
In order to test the computational time complexity of our algorithm, we used synthetic categoric data (Section VI-A5) of various sizes, built using the same rules. For each dataset, GALILEO was able to find the known true and accurately cluster the data points. Figure 8 shows the timing results of this test; multiple runs were performed for each value of yielding highly consistent timings. Once the number of records reaches a certain threshold, our scaling is very close to the theoretical time complexity , for a fixed .
In this paper, we have presented a new method of generating mixture models in linear time for data with categorical attributes. The keys to this approach are the entropy-based density metric in categorical space and the annealing of high-entropy/low-density components from an initial state with many components. Pruning of low-density components using the entropy-based density allows GALILEO to consistently find high-quality clusters and the same optimal number of clusters. GALILEO has shown promising results on a range of test datasets commonly used for categorical clustering benchmarks. In particular, we have shown GALILEO’s annealing approach and density-based pruning consistently finds the optimal clustering (based on our concept of density) on the mushroom dataset. Perhaps more importantly, we have demonstrated that the scaling of GALILEO is linear in the number of records in the dataset, making this method suitable for very large categorical datasets. GALILEO can be naturally extended to include numerical attributes and datasets with mixed attribute types. In the future, we will expand the applications of this method for use on datasets consisting of mixed attributes and compare GALILEO’s performance on numerical data to traditional numerical clustering algorithms.
This work was supported by internal research and development funding provided by JHU/APL.