kmodes
Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
view repo
Many clustering algorithms exist that estimate a cluster centroid, such as K-means, K-medoids or mean-shift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a K-modes objective function by combining the notions of density and cluster assignment. The algorithm becomes K-means and K-medoids in the limit of very large and very small scales. Computationally, it is slightly slower than K-means but much faster than mean-shift or K-medoids. Unlike K-means, it is able to find centroids that are valid patterns, truly representative of a cluster, even with nonconvex clusters, and appears robust to outliers and misspecification of the scale and number of clusters.
READ FULL TEXT VIEW PDFPython implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
We maximize the objective function
(4) | |||
For a given assignment , this can be seen as (proportional to) the sum of a kde as in eq. (2) but separately for each cluster. Thus, a good clustering must move centroids to local modes, but also define separate kdes. This naturally combines the idea of clustering through binary assignment variables with the idea that high-density points are representative of a cluster (for suitable bandwidth values).
As a function of the bandwidth , the -modes objective function has two interesting limit cases. When , it becomes -means. This can be seen from the centroid update (which becomes the mean), or from the objective function directly. Indeed, approximating it with Taylor’s theorem for very large and using the fact that gives
where is the same as in eq. (1) and is subject to the same constraints. Thus, maximizing becomes minimizing , exactly the -means problem. When , it becomes a -medoids algorithm, since the centroids are driven towards data points. Thus,
-modes interpolates smoothly between these two algorithms, creating a continuous path that links a
-mean to a -medoid. However, its most interesting behavior is for intermediate .As is the case for -means and -medoids, minimizing the -modes objective function is NP-hard. We focus on iterative algorithms that find a locally optimum clustering in the sense that no improvement is possible on the centroids given the current assignments, and vice versa. We give first an algorithm for fixed and then use it to construct a homotopy algorithm that sweeps over a interval.
It is convenient to use alternating optimization:
Over assignments for fixed , the constrained problem separates into a constrained problem for each point , of the form
with . The solution is given by assigning point to its closest centroid in Euclidean distance (assuming the kernel is a decreasing function of the Euclidean distance).
Over centroids for fixed , we have a separate unconstrained maximization for each centroid, of the form
which is proportional to the cluster kde, and can be done with mean-shift. Note the step over need not be exact, i.e., the centroids need not converge to their corresponding modes. We exit when a tolerance is met or mean-shift iterations have been run.
Thus, the algorithm operates similarly to -means but finding modes instead of means: it interleaves a hard assignment step of data points to centroids with a mode-finding step that moves each centroid to a mode of the kde defined by the points currently assigned to it.
Convergence of this algorithm (in value) follows from the facts that each step (over or over ) is strictly feasible and decreases the objective or leaves it unchanged, and that the objective function is lower bounded by 0 within the feasible set. Besides, since there is a finite number of assignments, convergence occurs in a finite number of outer-loop steps (as happens with -means) if the step over is exact and deterministic. By this we mean that for each we find deterministically a maximum of its objective function (i.e., the mode for is a deterministic function of ). This prevents the possibility that for the same assignment we find different modes for a given , which could lead the algorithm to cycle. This condition can be simply achieved by using an optimization algorithm that either has no user parameters (such as step sizes; mean-shift is an example), or has user parameters set to fixed values, and running it to convergence. The convergence point is a local maximum in the sense that has a local maximum at and has a global maximum at .
The computational cost per outer-loop iteration of this algorithm (setting for simplicity in the mean-shift step) is identical to that of -means: the step over is and the step over is (where is the number of points currently assigned to ), for a total of . And also as in -means, the steps parallelize: over , the mean-shift iteration proceeds independently in each cluster; over , each data point can be processed independently.
We start with (i.e., run -means, possibly several times and picking the best optimum). Then, we gradually decrease while running iterations of the fixed- -modes algorithm for each value of , until we reach a target value . This follows an optimum path for . In practice, as is well known with homotopy techniques, this tends to find better optima than starting directly at the target value . We use this homotopy algorithm in our experiments. Given we have to run -means multiple times to find a good initial optimum (as commonly done in practice), the homotopy does not add much computation. Note that the homotopy makes -modes a deterministic algorithm given the local optimum found by -means.
The basic user parameter of -modes is the desired number of clusters . The target bandwidth in the homotopy is simply used as a scaling device to refine the centroids. We find that representative, valid centroids are obtained for a wide range of intermediate values. A good target can be obtained with a classical bandwidth selection criterion for kernel density estimates Wand and Jones (1994), such as the average distance to the th nearest neighbor.
Practically, a user will typically be interested in the centroids and clusters resulting for the target bandwidth. However, examining the centroid paths can also be interesting for exploratory analysis of a dataset, as illustrated in our experiments with handwritten digit images.
-modes is most closely related to -means and to Gaussian mean-shift (GMS), since it essentially introduces the kernel density estimate into the -means objective function. This allows -modes to find exactly true modes in the data (in its mathematical sense, i.e., maxima of the kde for each cluster), while achieving assignments as in -means, and with a fast runtime, thus enjoying some of the best properties from both -means and GMS.
-means and -modes have the same update step for the assignments, but the update step for the centroids is given by setting each centroid to a different location statistic of the points assigned to it: the mean for -means, a mode for -modes. -means and -modes also define the same class of clusters (a Voronoi tessellation, thus convex clusters), while GMS can produce possibly nonconvex, disconnected clusters.
In GMS, the number of clusters equals the number of modes, which depends on the bandwidth . If one wants to obtain exactly modes, there are two problems. The first one is computational: since is an implicit, nonlinear function of , finding a value that produces modes requires inverting this function. This can be achieved numerically by running mean-shift iterations while tracking as in scale-space approaches Collins (2003)
, but this is very slow. Besides, particularly for high-dimensional data, the kde only achieves
modes for a very narrow (even empty) interval of . The second problem is that even with an optimally tuned bandwidth, a kde will usually create undesirable, spurious modes where data points are sparse (e.g. outliers or cluster boundaries), again particularly with high-dimensional data. We avoid this problem in the homotopy version of -modes by starting with large , which tracks important modes. The difference between -modes and GMS is clearly seen in the particular case where we set (as in fig. 1): -modes runs the mean-shift update initialized from the data mean, so as decreases, this will tend to find a single, major mode of the kde. However, the kde itself will have many modes, all of which would become clusters under GMS.The fundamental problem in GMS is equating modes with clusters. The true density of a cluster may well be multimodal to start with. Besides, in practice a kde will tend to be bumpy unless the bandwidth is unreasonably large, because it is by nature a sum of bumpy kernels centered at the data points. This is particularly so with outliers (which create small modes) or in high dimensions. There is no easy way to smooth out a kde (increasing the bandwidth does smooth it, but at the cost of distorting the overall density) or to postprocess the modes to select “good” ones. One has to live with the fact that a good kde will often have multiple modes per cluster.
-modes provides one approach to this problem, by separating the roles of cluster assignment and of density. Each cluster has its own kde, which can be multimodal, and the homotopy algorithm tends to select an important mode among these within each cluster. This allows -modes to achieve good results even in high-dimensional problems, where GMS fails.
Computationally, -modes and -means are per iteration for a dataset of points in dimensions. While -modes in its homotopy version will usually take more iterations, this extra runtime is small because in practice one runs -means multiple times from different initializations to achieve a better optimum. GMS is per iteration, which is far slower, particularly with large datasets. The reason is that in GMS the kde involves all points and one must run mean-shift iterations started from each of the points. However, in -modes the kde for cluster involves only the points assigned to it and one must run mean-shift iterations only for the centroid . Much work has addressed approximating GMS so that it runs faster, and some of it could be applied to the mean-shift step in -modes, such as using Newton or sparse EM iterations Carreira-Perpiñán (2006).
In addition to these advantages, our experiments show that -modes can be more robust than -means and GMS with outliers and with misspecification of either or .
There are two variations of mean-shift that replace the local mean step of eq. (3) with a different statistic: the local (Tukey) median Shapira et al. (2009) and a medoid defined as any dataset point which minimizes a weighted sum of squared distances Sheikh et al. (2007). Both are really medoid algorithms, since they constrain the centroids to be data points, and do not find true modes (maxima of the density). In general, -medoid algorithms such as -centers or -medians are combinatorial problems, typically NP-hard Hochbaum and Shmoys (1985); Kaufman and Rousseeuw (1990); Meyerson et al. (2004). In the limit , -modes can be seen as a deterministic annealing approach to a -medoids objective (just as the elastic net Durbin and Willshaw (1987) is for the traveling salesman problem).
There exists another algorithm called “-modes” Huang (1998); Chaturvedi et al. (2001). This is defined for categorical data and uses the term “mode” in a generic sense of “centroid”. It is unrelated to our algorithm, which is defined for continuous data and uses “mode” in its mathematical sense of density maximum.
We compare with -means and Gaussian mean-shift (GMS) clustering. For -means, we run it 20 times with different initializations and pick the one with minimum value of in eq. (1). For -modes, we use its homotopy version initialized from the best -means result and finishing at a target bandwidth (whose value is set either by using a kde bandwidth estimation rule or by hand, depending on the experiment).
Figures 2 and 3 illustrate the three algorithms in 2D examples. They show the modes and the kde contours for each cluster, for or equivalently -means (left panel) and for an intermediate (right panel). We run -modes decreasing geometrically in steps from to in fig. 2 and from to in fig. 3.
sigma=10[b][] (-means) | sigma=1[b][] |
sigma=10[b][] (-means) | sigma=0.1[b][] |
In fig. 2, which has 3 Gaussian clusters, we purposefully set (both -means and -modes work well with ). This makes -means put one of the centroids in a low-density area, where no input patterns are found. -modes moves the centroid inside a cluster in a maximum-density area, where many input patterns lie, and is then more representative.
In fig. 3, the “two-moons” dataset has two nonconvex, interleaved clusters and we set . The “moons” cannot be perfectly separated by either -means or -modes, since both define Voronoi tessellations. However, -modes does improve the clusters over those of -means, and as before it moves the centroids from a region where no patterns are found to a more typical location within each cluster. Note how, although the bandwidth used () yields a very good kde for each cluster and would also yield a very good kde for the whole dataset, it results in multiple modes for each “moon”, which means that GMS would return around 13 clusters. In this dataset, no value of results in two modes that separate the moons.
One might argue that, if a -means centroid is not a valid pattern, one could simply replace it with the data point from its cluster that is closest to it. While this sometimes works, as would be the case in the rotated-digit-1 of fig. 1, it often does not: the same-cluster nearest neighbor could be a point on the cluster boundary, therefore atypical (fig. 2) or even a point in the wrong cluster (fig. 3). -modes will find points interior to the clusters, with higher density and thus more typical.
We construct an undirected graph similar to many real-world graphs and estimate the distribution of the degree of each vertex Newman (2010). To construct the graph, we generated a random (Erdős-Rényi) graph (with vertices and edges), which has a Gaussian degree distribution, and a graph with a power-law (long-tailed) distribution (with vertices and edges), and then took the union of both graphs and added a few edges at random connecting the two subgraphs. The result is a connected graph with two types of vertices, reminiscent of real-world networks such as the graph of web pages and their links in the Internet. Thus, our dataset has points in 1D (the degree of each vertex). As shown in fig. 4
, the degree distribution is a mixture of two distributions that are well-separated but have a very different character: a Gaussian and a skewed, power-law distribution. The latter results in a few vertices having a very large degree (e.g. Internet hubs), which practically appear as outliers to the far right (outside the plots).
We set . -means obtains a wrong clustering. One centroid is far to the right, in a low-density (thus unrepresentative) region, and determines a cluster containing the tail of the power-law distribution; this is caused by the outliers. The other centroid is on the head of the power-law distribution and determines a cluster containing the Gaussian and the head of the power-law distribution.
We run -modes decreasing from to geometrically in steps. -modes shifts the centroids to the two principal modes of the distributions and achieves a perfect clustering. Note that the kde for the power-law cluster has many modes, but -modes correctly converges to the principal one.
GMS cannot separate the two distributions for any value of . Setting small enough that the kde has the two principal modes implies it also has many small modes in the tail because of the outliers (partly visible in the second panel). This is a well-known problem with kernel density estimation.
histogram | per-cluster kde | whole-data kde |
---|---|---|
sigma=200[b][] histogram[][t] (-means) degree | kde degree | kde degree |
sigma=29.8551[b][] histogram[][t] degree | degree | degree |
sigma=15.1361[b][] histogram[][t] degree | degree | degree |
sigma=3.8905[b][] histogram[][t] degree | degree | degree |
sigma=1[b][] histogram[][t] degree[t][]degree | sigma=1[b][] degree[t][]degree | degree[t][]degree |
We selected 100 random images ( grayscale) from the USPS dataset for each digit 0–9. This gives a dataset of points in . We ran -means and -modes with , decreasing from to geometrically in steps.
Fig. 5 shows that most of the centroids for -means are blurry images consisting of an average of digits of different identity and style (slant, thickness, etc.), as seen from the 20 nearest-neighbor images of each centroid (within its cluster). Such centroids are hard to interpret and are not valid digit images. This also shows how the nearest neighbor to the centroid may be an unusual or noisy input pattern that is not representative of anything except itself.
-modes unblurs the centroids as decreases. The class histograms for the 20 nearest-neighbors show how the purity of each cluster improves: for -means most histograms are widely distributed, while -modes concentrates the mass into mostly a single bin. This means -modes moves the centroids onto typical regions that both look like valid digits, and are representative of their neighborhood. This can be seen not just from the class labels, but also from the style of the digits, which becomes more homogeneous under -modes (e.g. see cluster , containing digit-6 images, or and , containing digit-0 images of different style).
Stopping -modes at an intermediate (preventing it from becoming too small) achieves just the right amount of smoothing. It allows the centroids to look like valid digit images, but at the same time to average out noise, unusual strokes or other idiosyncrasies of the dataset images (while not averaging digits of different identities or different styles, as -means does). This yields centroids that are more representative even than individual images of the dataset. In this sense, -modes achieves a form of intelligent denoising similar to that of manifold denoising algorithms Wang and Carreira-Perpiñán (2010).
Note that, for -modes, centroids and look very similar, which suggests one of them is redundant (while none of the -means centroids looked very similar to each other). Indeed, removing and rerunning -modes with simply reassigns nearly all data points in the cluster of to that of and the centroid itself barely changes. This is likely not a casuality. If we have a single Gaussian cluster but use , it will be split into sectors like a pie, but in -means the centroids will be apart from each other, while in -modes they will all end up near the Gaussian center, where the mode of each kde will lie. This suggests that redundancy may be easier to detect in -modes than in -means.
GMS with gives exactly modes, however of these one is a slanted-digit-1 cluster like in -modes and contains % of the training set points, and the remaining 9 modes are associated with clusters containing between 1 and 4 points only, and their centroids look like digits with unusual shapes, i.e., outliers. As noted before, GMS is sensitive to outliers, which create modes at nearly all scales. This is particularly so with high-dimensional data, where data is always sparse, or with data lying on a low-dimensional manifold (both of which occur here). In this case, the kde changes from a single mode for large to a multiplicity of modes over a very narrow interval of .
sigma=Inf[][l]-modes (-means) |
sigma=1[][l]-modes |
sigma=1.8369[][l]GMS |
Finally, we report clustering statistics in datasets with known pattern class labels (which the algorithms did not use): (1) COIL–20, which contains grayscale images of 20 objects viewed from varying angles. (2) MNIST, which contains grayscale handwritten digit images (we randomly sample 200 of each digit). And (3) the NIST Topic Detection and Tracking (TDT2) corpus, which contains on-topic documents of different semantic categories (we removed documents appearing in more than one category and kept only the largest 30 categories).
For each algorithm, we compare its clustering with the ground-truth one using two commonly used criteria for evaluating clustering results: the Adjusted Rand Index and the Normalized Mutual Information Manning et al. (2008). The results appear in table 1. For -means, we show the best result of 20 random initializations. -modes was initialized from this -means result and run by homotopy to a target bandwidth . We show two results for -modes, each with a different bandwidth value. The one in parentheses corresponds to a target estimated as the average distance of each point to its th nearest neighbor (a commonly used bandwidth estimation rule). The other one (not in parentheses) corresponds to the best result for , i.e., we enlarge a bit the interval of bandwidths for the homotopy. For GMS, we select to give exactly the desired modes (which, as before, is cumbersome). The best results for each dataset are in boldface. -modes improves over -means if using a bandwidth estimated automatically, but it improves even more if searching further bandwidths. GMS gives poor results for the reasons described earlier.
Adjusted Rand Index (%) | Normalized Mutual Info. (%) | |||||
---|---|---|---|---|---|---|
Dataset | -means | -modes | GMS | -means | -modes | GMS |
COIL | 56.5 | 62.1 (62.1) | 11.6 | 76.8 | 79.1 (79.1) | 49.8 |
MNIST | 32.9 | 35.5 (34.4) | 1.37 | 46.4 | 49.2 (47.4) | 11.7 |
TDT2 | 55.8 | 56.1 (56.1) | N/A | 80.0 | 80.8 (80.7) | N/A |
The previous experiments suggest that -modes is more robust than -means and GMS to outliers and parameter misspecification ( or ). Outliers shift centroids away from the main mass of a cluster in -means or create spurious modes in GMS, but -modes is attracted to a major mode within each cluster. GMS is sensitive to the choice of bandwidth, which determines the number of modes in the kde. However, -modes will return exactly modes (one per cluster) no matter the value of the bandwidth, and whether the kde of the whole dataset has more or fewer than modes. -means is sensitive to the choice of : if it is smaller than the true number of clusters, it may place centroids in low-density regions between clusters (which are invalid patterns); if it is larger than the true number of clusters, multiple centroids will compete for a cluster and partition it, yet the resulting centroids may show no indication that this happened. With -modes, if is too small the centroids will move inside the mass of each cluster and become valid patterns. If is too large, centroids from different portions of a cluster may look similar enough that their redundancy can be detected.
While -modes is a generic clustering algorithm, an important use is in appplications where one desires representative centroids in the sense of being valid patterns, typical of their cluster, as described earlier. By making small enough, -modes can always force the centroids to look like actual patterns in the training set (thus, by definition, valid patterns). However, an individual pattern is often noisy or idiosyncratic, and a more typical and still valid pattern should smooth out noise and idiosyncrasies—just as the idea of an “everyman” includes features common to most men, but does not coincide with any actual man. Thus, best results are achieved with intermediate bandwidth values: neither too large that they average widely different patterns, not too small that they average a single pattern, but just small enough that they average a local subset of patterns—where the average is weighted, as given by eq. (3) but using points from a single cluster. Then, the bandwidth can be seen as a smoothing parameter that controls the representativeness of the centroids. Crucially, this role is separate from that of , which sets the number of clusters, while in mean-shift both roles are conflated, since the bandwidth determines both the smoothing and the number of clusters.
How to determine the best bandwidth value? Intuitively, one would expect that bandwidth values that produce good densities should also give reasonable results with -modes. Indeed, this was the case in our experiments using a simple bandwidth estimation rule (the average distance to the th nearest neighbor). In general, what “representative” means depends on the application, and -modes offers potential as an exploratory data analysis tool. By running the homotopy algorithm from large bandwidths to small bandwidths (where “small” can be taken as, say, one tenth of the result from a bandwidth estimator), the algorithm conveniently presents to the user a sequence of centroids spanning the smoothing spectrum. As mentioned before, the computational cost of this is comparable to that of running -means multiple times to achieve a good optimum in the first place. Finally, in other applications, one may want to use -modes as a post-processing of the -means centroids to make them more representative.
Our -modes algorithm allows the user to work with a kernel density estimate of bandwidth (like mean-shift clustering) but produce exactly clusters (like -means). It finds centroids that are valid patterns and lie in high-density areas (unlike -means), are representative of their cluster and neighborhood, yet they average out noise or idiosyncrasies that exist in individual data points. Computationally, it is slightly slower than -means but far faster than mean-shift. Theory and experiments suggest that it may also be more robust to outliers and parameter misspecification than -means and mean-shift.
Our -modes algorithm can use a local bandwidth at each point rather than a global one, and non-gaussian kernels, in particular finite-support kernels (such as the Epanechnikov kernel) may lead to a faster algorithm. We are also working on -modes formulations where the assignment variables are relaxed to be continuous.
A main application for -modes is in clustering problems where the centroids must be interpretable as valid patterns. Beyond clustering, -modes may also find application in problems where the data fall in a nonconvex low-dimensional manifold, as in finding landmarks for dimensionality reduction methods de Silva and Tenenbaum (2003)
, where the landmarks should lie on the data manifold; or in spectral clustering
Ng et al. (2002), where the projection of the data on the eigenspace of the graph Laplacian defines a hypersphere.
Work funded in part by NSF CAREER award IIS–0754089.
Mode-finding for mixtures of Gaussian distributions.
IEEE Trans. Pattern Analysis and Machine Intelligence, 22(11):1318–1323, Nov. 2000.Proc. of the 2006 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’06)
, pages 1160–1167, New York, NY, June 17–22 2006.A best possible heuristic for the
-center problem. Math. Oper. Res., 10(2):180–184, May 1985.Finding Groups in Data: An Introduction to Cluster Analysis
.Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, 1990.