ExKMC: Expanding Explainable k-Means Clustering

06/03/2020 ∙ by Nave Frost, et al. ∙ Tel Aviv University University of California, San Diego 26

Despite the popularity of explainable AI, there is limited work on effective methods for unsupervised learning. We study algorithms for k-means clustering, focusing on a trade-off between explainability and accuracy. Following prior work, we use a small decision tree to partition a dataset into k clusters. This enables us to explain each cluster assignment by a short sequence of single-feature thresholds. While larger trees produce more accurate clusterings, they also require more complex explanations. To allow flexibility, we develop a new explainable k-means clustering algorithm, ExKMC, that takes an additional parameter k' ≥ k and outputs a decision tree with k' leaves. We use a new surrogate cost to efficiently expand the tree and to label the leaves with one of k clusters. We prove that as k' increases, the surrogate cost is non-increasing, and hence, we trade explainability for accuracy. Empirically, we validate that ExKMC produces a low cost clustering, outperforming both standard decision tree methods and other algorithms for explainable clustering. Implementation of ExKMC available at https://github.com/navefr/ExKMC.



There are no comments yet.


page 2

Code Repositories


Expanding Explainable K-Means Clustering

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The bulk of research on explainable machine learning studies how to interpret the decisions of supervised learning methods, largely focusing on feature importance in black-box models 

[3, 25, 42, 46, 50, 51, 56, 57]. To complement these efforts, we study explainable algorithms for clustering, a canonical example of unsupervised learning. Most clustering algorithms operate iteratively, using global properties of the data to converge to a low-cost solution. For center-based clustering, the best explanation for a cluster assignment may simply be that a data point is closer to some center than any others. While this type of explanation provides some insight into the resulting clusters, it obscures the impact of individual features, and the cluster assignments often depend on the dataset in a complicated way.

Recent work on explainable clustering goes one step further by enforcing that the clustering be derived from a binary threshold tree [11, 17, 21, 28, 31, 43]. Each node is associated with a feature-threshold pair that recursively splits the dataset, and labels on the leaves correspond to clusters. Any cluster assignment can be explained by a small number of thresholds, each depending on a single feature. For large, high-dimensional datasets, this provides more information than typical clustering methods.

To make our study concrete, we focus on the -means objective. The goal is to find centers that approximately minimize the sum of the squared distances between data points in and their nearest center [1, 2, 4, 22, 37, 52]. In this context, Dasgupta, Frost, Moshkovitz, and Rashtchian have studied the use of a small threshold tree to specify the cluster assignments, exhibiting the first explainable -means clustering algorithm with provable guarantees [21]. They propose the Iterative Mistake Minimization (IMM) algorithm and prove that it achieves a worst-case approximation to the optimal -means clustering cost. However, the IMM algorithm and analysis are limited to trees with exactly leaves (the same as the number of clusters). They also prove a lower bound showing that an approximation is the best possible when restricted to trees with at most leaves.

Our goal is to improve upon two shortcomings of previous work by (i) providing an experimental evaluation of their algorithms and (ii) exploring the impact of using more leaves in the threshold tree. We posit that on real datasets it should be possible to find a nearly-optimal clustering of the dataset. In other words, the existing worst-case analysis may be very pessimistic. Furthermore, we hypothesize that increasing the tree size should lead to monotonically decreasing the clustering cost.

(a) -means++ (reference)
(b) leaves (IMM)
(c) leaves (ExKMC)
(d) leaves



(e) IMM tree






(f) ExKMC tree
Figure 1: Tree size (explanation complexity) vs. -means clustering quality.

As our main contribution, we propose a novel extension of the previous explainable -means algorithm. Our method, ExKMC, takes as input two parameters and a set with . It first builds a threshold tree with leaves using the IMM algorithm [21]. Then, given a budget of leaves, it greedily expands the tree to reduce the clustering cost. At each step, the clusters form a refinement of the previous clustering. By adding more thresholds, we gain more flexibility in the data partition, and we also allow multiple leaves to correspond to the same cluster (with clusters total).

To efficiently determine the assignment of leaves to clusters, we design and analyze a surrogate cost. Recall that the IMM algorithm first runs a standard -means algorithm, producing a set of centers that are given as an additional input [21]. As ExKMC expands the tree to leaves, it minimizes the cost of the current clustering compared to these reference centers. We assign each leaf a label based on the best reference center, which determines the cluster assignments. The main idea is that by fixing the centers between steps, we can more efficiently determine the next feature-threshold pair to add. We prove that the surrogate cost is non-increasing throughout the execution as the number of leaves grows. When , then the -means cost matches that of the reference clustering.

Figure 1 depicts the improvement from using more leaves. The left picture shows a near-optimal -means clustering. Next, the IMM algorithm with leaves leads to a large deviation from the reference clustering. Extending the tree to use leaves with ExKMC leads to a lower-cost result that better approximates the reference clustering. We form three clusters by subdividing the previous clusters and mapping multiple leaves to the same cluster. Finally, trees with an arbitrary number of leaves can perfectly fit the reference clustering. Figures 0(e) and 0(f) contains the trees for 0(b) and 0(c) respectively.

To test our new algorithm, we provide an extensive evaluation on many real and synthetic datasets. We show that ExKMC often has lower -means cost than several baselines. We also find that prior IMM analysis is pessimistic because their algorithm achieves clustering cost within 5–30% of the cost of standard -means algorithms. Next, we explore the effect of using more than leaves. On eleven datasets, we show that threshold trees with between and leaves actually suffice to get within 1–2% of the cost of a standard

-means implementation. The only outlier is CIFAR-10, where we conjecture that pixels are insufficient features to capture the clustering. Overall, we verify that it is possible to find an explainable clustering with high accuracy, while using only

leaves for -means clustering.

1.1 Related Work

We address the challenge of obtaining a low cost -means clustering using a small decision tree. Our approach has roots in previous works on clustering with unsupervised decision trees [7, 11, 16, 21, 23, 28, 43, 61] and in prior literature on extending decision trees for tasks beyond classification [29, 30, 35, 45, 60, 54, 55].

Besides the IMM algorithm [21], prior explainable clustering algorithms optimize different objectives than the -means cost, such as the Silhouette metric [11], density measures [43], or interpretability scores [58]. Most similar to our approach, a localized version of the 1-means cost has been used for greedily splitting nodes when growing the tree [28, 31]. We compare ExKMC against two existing methods: CUBT [28] and CLTree [43]. We also compare against KDTree [9] and, after generating the cluster labels, the standard decision tree classification method CART [15].

Clustering via trees is explainable by design. We contrast this with the indirect approach of clustering with a neural network and then explaining the network 

[38]. A generalization of tree-based clustering has been studied using rectangle-based mixture models [17, 18, 53]

. Their focus differs from ours as they consider including external information, such as prior knowledge, via a graphical model and performing inference with variational methods. Clustering after feature selection 

[13, 20]

or feature extraction 

[8, 14, 48] reduces the number of features, but it does not lead to an explainable solution because it requires running a non-explainable -means algorithm on the reduced feature space.

1.2 Our contributions

We present a new algorithm, ExKMC, for explainable -means clustering with the following properties:

Explainability-accuracy trade-off.  We provide a simple method to expand a threshold tree with leaves into a tree with a specified number of leaves. At each step, we aim to better approximate a given reference clustering (such as from a standard -means implementation). The key idea is to minimize a surrogate cost that is based on the reference centers instead of using the -means cost directly. By doing so, we efficiently expand the tree while obtaining a good -means clustering.

Convergence.  We demonstrate empirically that ExKMC quickly converges to the reference clustering as the number of leaves increases. On many datasets, the cost ratio between our clustering and the reference clustering goes to 1.0 as the number of leaves goes from to , where is the number of labels for classification datasets. In theory, we prove that the surrogate cost is non-increasing throughout the execution of ExKMC, verifying that we trade explainability for clustering accuracy.

Low cost.  On a dozen datasets, our algorithm often achieves a lower -means cost for a given number of leaves compared to four baselines: CUBT [28], CLTree [43], KDTree [9], and CART [44].

Speed.  Using only standard optimizations (e.g., dynamic programming), ExKMC can cluster fairly large datasets (e.g., CIFAR-10 or covtype) in under 15 minutes using a single processor, making it a suitable alternative to standard

-means in data science pipelines. The main improvement comes from using the surrogate cost, instead of the

-means cost, to determine the cluster assignments.

2 Preliminaries

We let . For , a -clustering refers to a partition of a dataset into clusters. Let be a -clustering of with and . The -means cost is . It is NP-hard to find the optimal -means clustering [2, 22] or a close approximation [5]. Standard algorithms for -means are global and iterative, leading to complicated clusters that depend on the data in a hard to explain way [1, 4, 37, 52].

Explainable clustering.  Let be a binary threshold tree with leaves, where each internal node contains a single feature and threshold . We also consider a labeling function that maps the leaves of to clusters. The pair induces a -clustering of as follows. First, is partitioned via using the feature-threshold pairs on the root-to-leaf paths. Then, each point is assigned to one of clusters according to how labels its leaf. Geometrically, the clusters reside in cells bounded by axis-aligned cuts, where the number of cells equals the number of leaves. This results in a -clustering with means , and we denote the -means cost of the pair as

Iterative Mistake Minimization (IMM).  Previous work has exhibited the IMM algorithm that produces a threshold tree with leaves [21]. It first runs a standard -means algorithm to find centers. Then, it iteratively finds the best feature-threshold pair to partition the data into two parts. At each step, the number of mistakes is minimized, where a mistake occurs if a data point is separated from its center. Each partition also enforces that at least one center ends up in both children, so that the tree terminates with exactly leaves. Each leaf contains one center at the end, and the clusters are assigned based this center. The IMM algorithm provides a approximation to the optimal -means cost, assuming that a constant-factor approximation algorithm generates the initial centers.

We use the IMM algorithm to build a tree with leaves as the initialization for our algorithm. Then, we smoothly trade explainability for accuracy by expanding the tree to have leaves, for a user-specified parameter . More formally, we develop an algorithm to solve the following problem:

Problem Formulation.  Given a dataset and parameters with , the goal is to efficiently construct a binary threshold tree with leaves and a function such that induces a -clustering of with as small -means cost as possible.

3 Our Algorithm

We describe our explainable clustering algorithm, ExKMC, that efficiently finds a tree-based -clustering of a dataset. Starting with a base tree (either empty or from an existing algorithm like IMM), ExKMC expands the tree by replacing a leaf node with two new children. In this way, it refines the clustering, while allowing the new children to be mapped to different clusters. A key optimization is to use a new surrogate cost to determine both the best threshold cut and the labeling of the leaves. At the beginning, we run a standard -means algorithm and generate reference centers. Then, the surrogate cost is the -means cost if the centers were the reference centers. By fixing the centers, instead of changing them at every step, we determine the cluster label for each leaf independently (via the best reference center). For a parameter , our algorithm terminates when the tree has leaves.

3.1 Surrogate cost

Our starting point is a set of reference centers , obtained from a standard -means algorithm. This induces a clustering with low -means cost. While it is possible to calculate the actual -means cost as we expand the tree, it is difficult and time-consuming to recalculate the distances to a dynamic set of centers. Instead, we fix the reference centers, and we define the surrogate cost as the sum of squared distances between points and their closest reference center.

Definition 1 (Surrogate cost).

Given centers and a threshold tree that defines the clustering , the surrogate cost is defined as

The difference between the new surrogate cost and the -means cost is that the centers are fixed. In contrast, the optimal -means centers are the means of the clusters, and therefore, they would change throughout the execution of the algorithm. Before we present our algorithm in detail, we mention that at each step the goal will be to expand the tree by splitting an existing leaf into two children. To see the benefit of the surrogate cost, consider choosing a split at each step that minimizes the actual -means cost of the new clustering. This requires time for each possible split because we must iterate over the entire dataset to calculate the new cost. In Section 3.2, we provide the detailed argument showing that ExKMC only takes time at each step, which is an improvement as the number of surviving points in a node often decreases rapidly as the tree grows.

The surrogate cost has several further benefits. The best reference center for a cluster is independent of the other cluster assignments. This independence makes the calculation of more efficient, as there is no need to recalculate the entire cost if some of the points are added or removed from a cluster. The surrogate cost is also an upper bound on the -means cost. Indeed, it is known that using the mean of a cluster as its center can only improve the -means cost. In Section 3.3, we provide theoretical guarantees, such as showing that our algorithm always converges to the reference clustering as the number of leaves increases. Finally, in Section 4, we empirically show that minimizing the surrogate cost still leads to a low-cost clustering, nearly matching the reference cost. We now utilize this idea to design a new algorithm for explainable clustering.

Algorithm 1 describes the ExKMC algorithm, which uses subroutines in Algorithm 2. It takes as input a value , a dataset , and a number of leaves . The first step will be to generate reference centers from a standard -means implementation and to build a threshold tree with leaves (for evaluation, is the output of the IMM algorithm). For simplicity, we refer to these as inputs as well. ExKMC outputs a tree and a labeling that assigns leaves to clusters. Notably, the clustering induced by always refines the one from .

Input : 

– Set of vectors in

– Set of reference centers -- Base tree -- Number of leaves Output :  Labeled tree with leaves 1 foreach  do 2       3while  do 4       return
Algorithm 1 ExKMC: Expanding
Explainable -Means Clustering
1 :  2       1 :  2       return 1 :  2       return Algorithm 2 Subroutines

Initialization.  In line 1, we first compute and store the gain of the leaves in . The gain of a split is the difference between the cost with split and without. The details are in the subroutine add_gain. It stores the best feature-threshold pair and cost improvement in splits and gains, respectively.

Growing the tree.  We expand the tree by splitting the node with largest gain in Line 1. We use best reference center from for each of its two children, using the subroutine find_labels. In Lines 11, we update the lists splits and gains. At the final step, we create a tree with labeled leaves, where the implicit labeling function maps a leaf to the lowest cost reference center.

3.2 Speeding up ExKMC

The running time is dominated by finding the best split at each step. Fortunately, this can executed in time where is the number of points surviving at node . The term comes from sorting the points in each coordinate. Then, we go over all splits and find the one that minimizes , where contains all points in where and is the best center for among the reference centers (cluster and center are defined similarly for points ). At first glance, it seems that the running time is , where is the number of possible splits, is the possible value of (and ), and is the time to calculate the cost of each split and center value. To improve the running time we can rewrite the cost as

Since we care about minimizing the last expression, we can ignore the term as it is independent of the identity of the split. Using dynamic programming, we go over all splits and save and . An update then only takes time, reducing the total time to When the dimensionality is large, we can use an additional improvement by saving for each and , reducing the total running time

3.3 Theoretical guarantees

Now that we have defined our ExKMC algorithm, we provide some guarantees on its performance (we defer all proofs for this section to Appendix A). We first prove two theorems about the surrogate cost, showing that it satisfies the two desirable properties of being non-increasing and also eventually converging to the reference clustering. Next, we verify that ExKMC has a worst-case approximation ratio of compared to the optimal -means clustering when using IMM to build the base tree. Finally, we provide a separation between IMM and ExKMC for a difficult dataset.

Theorem 1.

The surrogate cost, , is non-increasing throughout the execution of ExKMC.

To prove the theorem notice that when the algorithm performs a split at node , the cost of all points not in will remain intact. Additionally, the algorithm can assign the two children of the same label as in . This choice will not change the cost of the points in The full proof appears in the appendix. Eventually ExKMC converges to a low-cost tree, as the next theorem proves. Specifically, after steps in the worst-case, we will always get a refinement, see Corollary 2 in Appendix A.

Theorem 2.

Let be a reference clustering. If while running ExKMC the threshold tree refines , then the clustering induced by equals , where is the surrogate cost labeling.

We also provide a worst-case guarantee compared to the optimal -means solution. We show that the ExKMC algorithm using the IMM base tree provably achieves an approximation to the -means cost. The proof uses Theorem 1 combined with the previous analysis of the IMM algorithm [21], see Appendix A.

Theorem 3.

Algorithm 1 with the IMM base tree is an approximation to the optimal -means cost.

3.4 Example where ExKMC provably improves upon Imm

Prior work designed a dataset and proved that a tree with leaves incurs an approximation on this dataset [21], which we call Synthetic II. It contains clusters that globally are very far from each other, but locally, clusters look the same.

Synthetic II dataset.

This dataset consists of clusters where any two clusters are very far from each other while inside any cluster the points differ by at most two features. The dataset is created by first taking random binary codewords . The distance between any two codewords is at least where is some constant. For , we construct cluster by taking the codeword and then changing one feature at a time to be equal to The size of each cluster is There are in total points in the dataset . The optimal centers for this dataset are the codewords. We take for some universal constant , and we assume that is sufficiently large.

Our approach of explaining this dataset contains a few steps (i) using -means++ to generate the reference centers, (ii) building the base tree with IMM, and (iii) expanding the tree with ExKMC. We prove that this leads to an optimal tree-based clustering with leaves. In fact, the cost ratio decreases linearly as the number of leaves increases, as the next theorem proves. We experimentally confirm this linear decrease in the next section.

Theorem 4.

Let be the dataset described above. There is a constant

such that with probability at least

over the randomness in -means++, the output of ExKMC with centers from -means++, using the base tree from IMM and desired centers, has approximation ratio at most

compared to the optimal -means clustering of .

We show that initially the IMM algorithm has approximation ratio of . Interestingly, this is also the lower bound for trees with exactly leaves, and hence, it constructs the best possible tree up to a constant factor. The theorem also states that after steps we reach the optimal clustering. We state this observation as the following corollary.

Corollary 1.

With probability at least over the randomness in -means++, the output of ExKMC, with centers from -means++ and base tree of IMM, after steps of the algorithm, it reaches the optimal -means clustering of .

The full proof of Theorem 4 is in the appendix. The main steps of the proof are the following. We first prove that with high probability the output of the -means++ algorithm is actually the optimal set of centers, that is, the codewords in the dataset construction. Intuitively, this holds because globally the optimal centers and clusters are very far from each other. The second step is to define points that are mistakes, which are points in a leaf with the property that their optimal center is different than the label of To guarantee that the algorithm always makes progress, we prove that there is always a split that reduces the number of mistakes. Lastly, we prove that each such split reduces the cost by This completes the proof.

4 Empirical Evaluation

Algorithms.  We compare the following clustering methods (see Appendix B for details):

  • Reference Clustering. sklearn KMeans, 10 random initializations, 300 iterations.

  • CART. Standard decision tree from sklearn, minimizing gini impurity. Points in the dataset are assigned labels using the reference clustering.

  • KDTree.

    Split highest variance feature at median threshold. Size determined by

    leaf_size parameter. Label leaves to minimize w.r.t. centers of the reference clustering.

  • CLTree. Explainable clustering method. Public implementation [19, 43].

  • CUBT. Explainable clustering method. Public implementation [32, 28].

  • ExKMC. Our method (Algorithm 1) starting with an empty tree; minimizes at each split w.r.t. centers of the reference clustering.

  • ExKMC (base: IMM). Our method (Algorithm 1) starting with an IMM tree with leaves; then, minimizes at each split w.r.t. centers of the reference clustering.

Set-up.  We use 10 real and 2 synthetic datasets; details in Appendix B.2. The number of clusters and the number of leaves are inputs. We start with equal to number of labels for classification datasets. We plot the cost ratio compared to the reference clustering (best

). For the baselines, we do hyperparameter tuning and choose the lowest cost clustering at each

. CUBT and CLTree could only be feasibly executed on six small real datasets (we restrict computation time to one hour).

(a) Iris
(b) Wine
(c) Breast Cancer
(d) Digits
(e) Mice Protein
(f) Anuran
(g) Avila
(h) Covtype
(i) 20newsgroups
(j) CIFAR-10
(k) Synthetic I
(l) Synthetic II
Figure 2: Each graph plots the ratio (-axis) of the tree-based clustering cost to the near-optimal -means clustering as a function of the number of leaves (-axis). Lower is better with best . Our algorithm (black line) consistently performs well. See Figure 9 in Appendix B.4 for error bars.

4.1 Experimental Results

Real datasets.CLTree performs the worst in most cases, and the cost is often much larger than the other methods. Turning to CUBT, we see that on most datasets it is competitive (but often not the best). However, on Digits and Mice Protein, CUBT fails to converge to a good clustering. We see that CART performs well on many of the datasets, as expected. On Avila and 20newsgroups, CART has a fairly high cost. For the small datasets, KDTree performs competitively, but on large datasets, the cost remains high. On all of the datasets, our method ExKMC performs well. It often has the lowest cost throughout, except for CIFAR-10, where it requires around leaves to be competitive. When the number of leaves is exactly , we can also see the performance of the IMM algorithm, where we see that its cost is quite low, in contrast to the previous theoretical analysis [21]. As usual, the outlier is CIFAR-10, where the cost is very high at leaves, and then quickly decreases as ExKMC expands the tree. We also evaluate ExKMC by starting with an empty tree (without using IMM at all). In general, the cost is worse or the same as ExKMC with IMM. We observe that the CLTree cost varies as a function of the number of leaves. The reason is that we perform a hyperparameter search for each instance separately, and perhaps surprisingly, the cost can sometimes increase. While we could have used the best tree with fewer leaves, this may be less representative of real usage. Indeed, the user would specify a desired number of leaves, and they may not realize that fewer leaves would lead to lower -means cost.

Synthetic datasets.  We highlight two other aspects of explainable clustering. The Synthetic I dataset in Figure 1(k) is bad for CART. We adapt a dataset from prior work that forces CART to incur a large cost due to well-placed outliers [21]. CART has cost ratio above five, while other methods converge to a near-optimal cost. The Synthetic II dataset in Figure 1(l) is sampled from a hard distribution that demonstrates an lower bound for any explainable clustering with leaves [21]. The centers are random vectors in , and the clusters contain the vectors where one of the center’s coordinates is set to zero. As expected, the IMM algorithm starts with a high -means cost. When ExKMC expands the tree to have more leaves, the cost quickly goes down (proof in Appendix A.1).

Distance to the reference clustering.  In Appendix B.4, we report accuracy when points are labeled by cluster. Overall, we see the same trends as with cost. However, on some datasets, CART more closely matches the reference clustering than ExKMC, but the CART clustering has a higher cost. This highlights the importance of optimizing the -means cost, not the classification accuracy.

Qualitative analysis.  Figure 4 depicts two example trees with four and eight leaves, respectively. on a subset of four clusters from the 20newsgroups dataset. The IMM base tree uses three features (words) to define four clusters. Then, ExKMC expands one of the leaves into a larger subtree, using seven total words to construct more nuanced clusters that better correlate with the newsgroup topics.

Running time.  Figure 4 shows the runtime of three methods on seven real datasets (commodity laptop, single process, i7 CPU @ 2.80GHz, 16GB RAM). Both IMM and ExKMC first run KMeans (from sklearn, 10 initializations, 300 iterations), and we report cumulative times. The explainable algorithms construct trees in under 15 minutes. On six datasets, they incur to overhead compared to standard KMeans. The 20newsgroups dataset has the largest overhead because sklearn optimizes for sparse vectors while IMM and ExKMC currently do not. In Appendix B.4, we also measure an improvement to ExKMC with feature-wise parallelization of the expansion step.

Surrogate vs. actual cost.  In Figure 5, we compare the surrogate cost (using the reference centers) with the actual -means cost (using the means of the clusters as centers). In general, the two costs differ by at most 5-10%. On three datasets, they converge to match each other as the number of leaves grows. Covtype is an exception, where the two costs remain apart.

Figure 3: Explainability-accuracy trade-off using more leaves on a subset of four clusters from 20newsgroups. As ExKMC expands the IMM tree, it refines the clusters and reduces the -means cost.
Figure 4: Runtime.

(a) Avila
(b) Covtype
(c) 20newsgroups
(d) CIFAR-10
Figure 5: Comparison of (dashed line) vs. (full line) of ExKMC with IMM base tree.

4.2 Discussion

Trade-off.  The main objective of our new algorithm is to provide a flexible trade-off between explainability and accuracy. Compared to the previous IMM algorithm, we see that using ExKMC to expand the threshold tree consistently leads to a lower cost clustering. We also see that using our surrogate cost improves the running time without sacrificing effectiveness.

Convergence.  On some datasets, IMM produces a fairly low -means cost with leaves. Expanding the tree with ExKMC to between and leaves often results in nearly the same cost as the reference clustering. CIFAR-10 is an outlier, where none of the methods converge when using pixels as features (see also Appendix B.4). In practice, a tree with leaves only slightly increases the explanation complexity compared to a tree with leaves (and leaves are necessary). ExKMC successfully achieves a good tree-based clustering with better interpretability than standard -means methods.

Low cost.  The most striking part of the experiments is that tree-based clusterings can match the cost of standard clusterings. This is possible with a small tree, even on large, high-dimensional datasets. Prior to our work, this was not known to be the case. Therefore, ExKMC demonstrates that explanability can be obtained in conjunction with low cost on many datasets.

5 Conclusion

We exhibited a new algorithm, ExKMC, for generating an explainable -means clustering using a threshold tree with a specified number of leaves. Theoretically, our algorithm has the property that as the number of leaves increases, the cost is non-increasing. This enables a deliberate trade-off between explainability and clustering cost. Extensive experiments showed that our algorithm often achieves lower -means cost compared to several baselines. Moreover, we saw that ExKMC usually matches the reference clustering cost with only leaves. Overall, ExKMC efficiently produces explainable clusters, and it could potentially replace standard -means implementations in data science pipelines. For future work, it would be interesting to prove convergence guarantees for ExKMC

either based on separation properties of real data or a Gaussian mixture model assumption. Another direction is to reduce the running time through better parallelism and sparse feature vector optimizations. Finally, our algorithm is feature-based, and therefore, if the features or data have biases, then these biases may also propagate and lead to biased clusters. It would be interesting future work to develop an explainable clustering algorithm that enforces fairness either in the feature selection process or in the composition of the clusters (for examples of such methods, see 

[6, 10, 34, 39, 47, 59]).


We thank Sanjoy Dasgupta for helpful discussions. We also thank Alyshia Olsen for help designing the figures. Nave Frost has been funded by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (Grant agreement No. 804302). The contribution of Nave Frost is part of a Ph.D. thesis research conducted at Tel Aviv University.


  • [1] A. Aggarwal, A. Deshpande, and R. Kannan (2009) Adaptive sampling for k-means clustering. In

    Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques

    pp. 15–28. Cited by: §1, §2.
  • [2] D. Aloise, A. Deshpande, P. Hansen, and P. Popat (2009) NP-hardness of Euclidean sum-of-squares clustering. Machine learning 75 (2), pp. 245–248. Cited by: §1, §2.
  • [3] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al. (2020)

    Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai

    Information Fusion 58, pp. 82–115. Cited by: §1.
  • [4] D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms, pp. 1027–1035. Cited by: §1, §2.
  • [5] P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop (2015) The hardness of approximation of Euclidean k-means. In 31st International Symposium on Computational Geometry, SoCG 2015, pp. 754–767. Cited by: §2.
  • [6] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner (2019) Scalable fair clustering. In International Conference on Machine Learning, pp. 405–413. Cited by: §5.
  • [7] J. Basak and R. Krishnapuram (2005)

    Interpretable hierarchical clustering by constructing an unsupervised decision tree

    IEEE transactions on knowledge and data engineering 17 (1), pp. 121–132. Cited by: §1.1.
  • [8] L. Becchetti, M. Bury, V. Cohen-Addad, F. Grandoni, and C. Schwiegelshohn (2019) Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma. In STOC, Cited by: §1.1.
  • [9] J. L. Bentley (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: 2nd item, §1.1, §1.2.
  • [10] S. Bera, D. Chakrabarty, N. Flores, and M. Negahbani (2019) Fair algorithms for clustering. In Advances in Neural Information Processing Systems, pp. 4955–4966. Cited by: §5.
  • [11] D. Bertsimas, A. Orfanoudaki, and H. Wiberg (2018) Interpretable clustering via optimal trees. arXiv preprint arXiv:1812.00539. Cited by: §1.1, §1.1, §1.
  • [12] J. A. Blackard and D. J. Dean (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24 (3), pp. 131–151. Cited by: Table 1.
  • [13] C. Boutsidis, P. Drineas, and M. W. Mahoney (2009) Unsupervised feature selection for the k-means clustering problem. In NIPS, pp. 153–161. Cited by: §1.1.
  • [14] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas (2014) Randomized dimensionality reduction for -means clustering. IEEE Transactions on Information Theory 61 (2), pp. 1045–1062. Cited by: §1.1.
  • [15] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen (1984) Classification and regression trees. CRC press. Cited by: 1st item, §1.1.
  • [16] J. Chang and D. Jin (2002)

    A new cell-based clustering method for large, high-dimensional data in data mining applications

    In Proceedings of the 2002 ACM symposium on Applied computing, pp. 503–507. Cited by: §1.1.
  • [17] J. Chen, Y. Chang, B. Hobbs, P. Castaldi, M. Cho, E. Silverman, and J. Dy (2016) Interpretable clustering via discriminative rectangle mixture model. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 823–828. Cited by: §1.1, §1.
  • [18] J. Chen (2018) Interpretable clustering methods. Ph.D. Thesis, Northeastern University. Cited by: §1.1.
  • [19] D. Christodoulou Python-package for clustering via decision tree construction. Note: https://github.com/dimitrs/CLTree Cited by: 4th item, 4th item.
  • [20] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu (2015) Dimensionality reduction for k-means clustering and low rank approximation. In STOC, Cited by: §1.1.
  • [21] S. Dasgupta, N. Frost, M. Moshkovitz, and C. Rashtchian (2020) Explainable -means and -medians clustering. arXiv preprint arXiv:2002.12538. Cited by: §A.1, §A.1, §A.1, §A.1, Appendix A, 6th item, §B.2, §1.1, §1.1, §1, §1, §1, §1, §2, §3.3, §3.4, §4.1, §4.1.
  • [22] S. Dasgupta (2008) The hardness of k-means clustering. In Technical Report, Cited by: §1, §2.
  • [23] L. De Raedt and H. Blockeel (1997) Using logical decision trees for clustering. In

    International Conference on Inductive Logic Programming

    pp. 133–140. Cited by: §1.1.
  • [24] C. De Stefano, M. Maniaci, F. Fontanella, and A. S. di Freca (2018) Reliable writer identification in medieval manuscripts through page layout features: the “avila” bible case. Engineering Applications of Artificial Intelligence 72, pp. 99–110. Cited by: Table 1.
  • [25] D. Deutch and N. Frost (2019) Constraints-based explanations of classifications. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 530–541. Cited by: §1.
  • [26] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Table 1.
  • [27] R. A. Fisher (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: Table 1.
  • [28] R. Fraiman, B. Ghattas, and M. Svarc (2013) Interpretable clustering using unsupervised binary trees. Advances in Data Analysis and Classification 7 (2), pp. 125–145. Cited by: 3rd item, §1.1, §1.1, §1.2, §1, 5th item.
  • [29] P. Geurts and G. Louppe (2011) Learning to rank with extremely randomized trees. In JMLR: workshop and conference proceedings, Vol. 14, pp. 49–61. Cited by: §1.1.
  • [30] P. Geurts, N. Touleimat, M. Dutreix, and F. d’Alché-Buc (2007) Inferring biological networks with output kernel trees. BMC Bioinformatics 8 (2), pp. S4. Cited by: §1.1.
  • [31] B. Ghattas, P. Michel, and L. Boyer (2017) Clustering nominal data using unsupervised binary decision trees: comparisons with the state of the art methods. Pattern Recognition 67, pp. 177–185. Cited by: §1.1, §1.
  • [32] B. Ghattas R-package for interpretable clustering using unsupervised binary trees. Note: http://www.i2m.univ-amu.fr/perso/badih.ghattas/CUBT.html Cited by: 3rd item, 5th item.
  • [33] C. Higuera, K. J. Gardiner, and K. J. Cios (2015) Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PloS one 10 (6). Cited by: Table 1.
  • [34] L. Huang, S. Jiang, and N. Vishnoi (2019) Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems, pp. 7587–7598. Cited by: §5.
  • [35] Y. Jernite, A. Choromanska, and D. Sontag (2017)

    Simultaneous learning of trees and representations for extreme classification and density estimation

    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1665–1674. Cited by: §1.1.
  • [36] T. Joachims (1996) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization.. Technical report Carnegie-mellon univ pittsburgh pa dept of computer science. Cited by: Table 1.
  • [37] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu (2002) A local search approximation algorithm for k-means clustering. In Proceedings of the Eighteenth Annual Symposium on Computational Geometry, pp. 10–18. Cited by: §1, §2.
  • [38] J. Kauffmann, M. Esders, G. Montavon, W. Samek, and K. Müller (2019) From clustering to cluster explanations via neural networks. arXiv preprint arXiv:1906.07633. Cited by: §1.1.
  • [39] M. Kleindessner, P. Awasthi, and J. Morgenstern (2019) Fair k-center clustering for data summarization. In International Conference on Machine Learning, pp. 3448–3457. Cited by: §5.
  • [40] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report. Cited by: Table 1.
  • [41] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Table 1.
  • [42] Z. C. Lipton (2018) The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §1.
  • [43] B. Liu, Y. Xia, and P. S. Yu (2005) Clustering via decision tree construction. In Foundations and Advances in Data Mining, pp. 97–124. Cited by: 4th item, §1.1, §1.1, §1.2, §1, 4th item.
  • [44] W. Loh (2011) Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1 (1), pp. 14–23. Cited by: §1.2.
  • [45] G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts (2013) Understanding variable importances in forests of randomized trees. In Advances in neural information processing systems, pp. 431–439. Cited by: §1.1.
  • [46] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §1.
  • [47] S. Mahabadi and A. Vakilian (2020) (Individual) fairness for -clustering. arXiv preprint arXiv:2002.06742. Cited by: §5.
  • [48] K. Makarychev, Y. Makarychev, and I. Razenshteyn (2019) Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In STOC, Cited by: §1.1.
  • [49] A. M. Mendoza-Henao, Á. M. Cortes-Gomez, M. A. Gonzalez, O. D. Hernandez-Córdoba, A. R. Acosta-Galvis, F. Castro-Herrera, J. M. Daza, J. M. Hoyos, M. P. Ramirez-Pinilla, N. Urbina-Cardona, et al. (2019) A morphological database for colombian anuran species from conservation-priority ecosystems. Ecology 100 (5), pp. e02685. Cited by: Table 1.
  • [50] C. Molnar (2019) Interpretable machine learning. Lulu. com. Note: https://christophm.github.io/interpretable-ml-book/ Cited by: §1.
  • [51] W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu (2019) Interpretable machine learning: definitions, methods, and applications. arXiv preprint arXiv:1901.04592. Cited by: §1.
  • [52] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy (2013) The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM 59 (6), pp. 1–22. Cited by: §1, §2.
  • [53] D. Pelleg and A. W. Moore (2001) Mixtures of rectangles: interpretable soft clustering. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pp. 401–408. Cited by: §1.1.
  • [54] K. Pliakos, P. Geurts, and C. Vens (2018) Global multi-output decision trees for interaction prediction. Machine Learning 107 (8-10), pp. 1257–1281. Cited by: §1.1.
  • [55] P. Ram and A. G. Gray (2011) Density estimation trees. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 627–635. Cited by: §1.1.
  • [56] M. T. Ribeiro, S. Singh, and C. Guestrin (2016)

    Why should I trust you?: explaining the predictions of any classifier

    In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1.
  • [57] C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §1.
  • [58] S. Saisubramanian, S. Galhotra, and S. Zilberstein (2020) Balancing the tradeoff between clustering value and interpretability. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 351–357. Cited by: §1.1.
  • [59] M. Schmidt, C. Schwiegelshohn, and C. Sohler (2019) Fair coresets and streaming algorithms for fair k-means. In International Workshop on Approximation and Online Algorithms, pp. 232–251. Cited by: §5.
  • [60] M. Schrynemackers, L. Wehenkel, M. M. Babu, and P. Geurts (2015) Classifying pairs with trees for supervised biological network inference. Molecular Biosystems 11 (8), pp. 2116–2125. Cited by: §1.1.
  • [61] Y. Yasami and S. P. Mozaffari (2010)

    A novel unsupervised classification approach for network anomaly detection by k-means clustering and id3 decision tree learning methods

    The Journal of Supercomputing 53 (1), pp. 231–245. Cited by: §1.1.

Appendix A Omitted proofs and extra theoretical results

Proof of Theorem 1..

Denote by the tree at iteration of the algorithm. We want to show that

We prove a stronger claim: for any possible split (not just for the one by ExKMC) the surrogate cost will not increase. Fix a cluster in and suppose that we split it into two clusters and that this results in the tree . For the split the algorithm chooses, we have .

We separate the surrogate cost into two terms: points that are in and those that are not

Importantly, the first term appears also in . Thus,

is equal to

This is at most because we can give and the same center in that was given to . ∎

Proof of Theorem 2..

Once defines a clustering that is a refinement, we know that for any cluster each points will be assigned the same center, i.e., for all , the value is the same. Thus,

which is exactly equal to the -means reference clustering. ∎

Corollary 2.

If ExKMC builds a tree with leaves, then it exactly matches the reference clustering, and the surrogate cost is equal to the reference cost.


If there is a step such that is a refinement of , then, using Theorem 2, the corollary holds. Using Theorem 1, the tree will continue to expand until either it is a refinement of or there are no more points and the tree contains leaves. A clustering defined by leaves is one where each point is in a cluster of its own, which is a refinement of More specifically, for the tree that contains leaves, we see that the surrogate cost is equal to

This is exactly equal to the -means reference clustering. ∎

Proof of Theorem 3..

Denote by the output threshold tree of the IMM algorithm and by the output threshold tree of Algorithm 1. We want to show that

The proof of Theorem 3 in [21] shows that Together with Theorem 1, we know that the surrogate cost is non-increasing, thus

As the surrogate cost upper bounds the -means cost, we know that

a.1 Hard-to-explain dataset

In [21] a dataset was shown that cannot have an -approximation with any threshold tree that has leaves. In this section we will show that there is a tree with only leaves the achieves the optimal clustering, even on this difficult dataset. Moreover, the ExKMC algorithm outputs such a threshold tree.

It will be useful later on to analyze the pairwise distances between the codewords. Using Hoeffding’s inequality we prove that the distances are very close to

Proposition 1 (Hoeffding’s inequality).


be independent random variables, where for each

, . Define the random variable Then, for any

Claim 1.

Let . With probability at least over the choice of random vectors in , all the pairwise distances between the vectors are


The average distance between any two random vectors is From Hoeffding’s inequality and union bound, the probability that there is a pair with distance that deviates by more than is bounded by

For the analysis of ExKMC on dataset we will use k-means++ to find the reference centers. For completeness, we write its pseudo-code and introduce the notion of distance of point from a set of points as

Input :  – Set of vectors in
-- Number of centers
Output :  – Set of centers
1 while  do
2       sample with probability
run Lloyd’s algorithm with centers return
Algorithm 3 -means++

In Claim 2 we will show that one iteration of Lloyd’s algorithm is enough for convergence. Summarizing, we consider the following form of our algorithm, that uses -means++ to find the reference centers and IMM algorithm for the base tree:

  1. Use the -means++ sampling procedure to choose centers from the dataset. Letting denote the centers at step , we add a new center by sampling a point with probability

  2. Use one iteration of Lloyd’s algorithm, where we first recluster based on the centers in , then take the means of the new clusters as the reference centers.

  3. Run the IMM algorithm with reference centers to construct a base tree with leaves.

  4. Expand the tree to leaves using ExKMC with reference centers and the IMM tree.

We first prove that after using the -means++ initialization and one iteration of Lloyd’s algorithm, the resulting centers are the optimal centers, i.e., the codewords.

Claim 2.

When running -means++ on , with probability at least the resulting centers are the optimal ones.


We will show that with probability at least , after the initialization step, the -means++ algorithm took one point from each cluster. This will imply that the the optimal clusters have been found, and the optimal centers will be returned after the next step in the -means++ algorithm, one iteration of Lloyd’s algorithm.

We prove by induction on that with probability at most , the variable does not contain points from different codewords, where is a lower bound to the codewords pairwise distances. The base of the induction is trivial. Suppose that , where and contains points from The distance from of each point that its cluster was taken already, , is at most . There are at most such points. The distance of points not taken is at least . There are at least such points. Thus, using union bound, the probability to take a point from is at most

We use the induction hypothesis and show that in total the probability to return points not all from different clusters is at most

Specifically, at the end, the error probability is bounded by

In the next claim we show that if we run IMM with the codewords as the reference centers, then the cost is bounded by The IMM algorithm can get the codewords as centers by running the -means++ algorithm and use the previous claim.

Claim 3.

The output threshold tree of the IMM algorithm on with the codewords as the set of reference centers, has -means cost


The work [21] defines the mistakes of an inner node as follows: If defines a split using feature  and threshold , a mistake is a point such that and its center both reach but then they become separated and go different directions, i.e.,

The cost of the IMM tree is composed of points that are mistakes and those that are not. All the points that are not mistakes contribute in total to the cost. Each point that is a mistake contributes at most to the cost. We will show that there are only mistakes and this completes the proof.

We consider each level of a tree, where a level is a set of nodes with the same number of edges on the path from the root. Due to the definition of the IMM algorithm, each center survives in at most one node in any level. Each center at a node can cause at most one mistake, the point in with a zero in the -th coordinate. Thus at each level there are only mistakes. From [21], Section B.5, we know that the depth is Thus, the total number of mistakes in the IMM tree is and the total cost is

Putting these results together, we are now ready to prove that our algorithm produces the optimal clustering on the Synthetic II dataset using a tree with leaves.

Proof of Theorem 4..

We generalize the notion of a mistake introduced in [21]. A mistake in a leaf is a point in the leaf that its center is not , the labeling of . Note that the number of mistakes in IMM upper bounds the number of mistakes in the leaves at the beginning of ExKMC run, which implies that at the beginning there are only mistakes in the leaves.

We will show that at each iteration of ExKMC, either there are no mistakes in the leaves or there is a split that will not incur more mistakes and actually reduce the number of mistakes. We will show that this will imply that the surrogate cost decreases by , for some constant . A split that does not change the number of mistakes, or even increase it will have a smaller gain. Thus, number of mistakes will guarantee to decrease. This will immediately prove that after iterations of ExKMC the clustering is the one defined by We want to prove something stronger, we want to analyze the trade-off between complexity (number of leaves added) and accuracy (the surrogate cost).

Let us analyze the approximation ratio as a function of the number of leaves added to the tree. If there are no mistakes, then we have found the optimal clustering. Otherwise, the total surrogate cost after IMM tree is , for some constant , and each new leaf will decrease the cost by . Together with the optimal cost being , for some constant , we deduce that if there are leaves (equivalently, iterations of ExKMC) and there are still mistakes, the approximation ratio is bounded by

In different words, there is a constant such that the approximation ratio is bounded by

Let us prove that if there is a mistake in a leaf, then some split decreases the surrogate cost by If there is leaf with a point that is a mistake, then does not belong to the codeword There is a feature where and (actually there are such coordinates, and any one of them will be good). Without loss of generality, assume that . Focus on the split with Denote by the partition of points in defined by this split. Suppose that Denote by all the points in that belong to the cluster This choice of threshold ensures that all points in are in and not in . This means that all points in are now in a different cluster than after this split. In different words, all points that were not mistakes as they were in will remain as non-mistakes. Furthermore, this split will make a non-mistake. In there are at most points (because only mistakes can be in ). Using Claim 1, their cost from changing the center can increase by at most , and they decrease the surrogate cost by Thus the decrease in this split is at least As the ExKMC algorithm chooses the split minimizing the surrogate cost, the cost decreases by in this step.

The last thing we need to show is that for every split with gain, the number of mistakes must go down. Suppose the algorithm made a split at node where previously there were points and mistakes. After the split there are nodes on the right successor and on the left, and mistakes on the right and mistakes on the left.

Letting , the cost of each node is, by Claim 1,

Thus, the maximal gain of a split is achieved in case has the largest cost and its successors the smallest. Specifically, the maximal gain is

The last term is equal to

For the gain to be , or even positive, the number of mistakes must decrease. ∎

Appendix B More Experimental Details

b.1 Algorithms and Baselines

Our empirical evaluation compared the following clustering methods.

  • CART [15]: Each data point was assigned with a class label, based on the clustering result of the near-optimal baseline. We have used sklearn implementation of decision tree minimizing the gini impurity. Number of leaves was controlled by max_leaf_nodes parameter.

  • KDTree [9]: For each tree node the best cut was chosen by taking the coordinate with highest variance, and split according to the median threshold. The size of the constructed tree is controlled through leaf_size parameter, since the splits are always balanced we obtained a tree with up to leaves by constructing a KDTree with . Each of the leaves was labeled with a cluster id from to , such the will be minimized with the centers of the near-optimal baseline.

  • CUBT [28]: Constructed clustering tree using CUBT R package [32]. CUBT algorithm is composed out of three steps: (1) build a large tree, (2) prune branches, and (3) join tree leaves labels. Each step is controlled by multiple parameters. For each step we applied grid-search over multiple parameters, and selected the parameters that minimize the final tree . The hyper-parameters were chosen based on [28] recommendation, and available in B.3. To construct a tree with leaves the we have set in the prune step, and to verify that only clusters will be constructed we have set in the join step.

  • CLTree [43]: Constructed clustering tree using CLTree Python package [19]. CLTree algorithm first construct a large tree with up to min_split samples in each leaf, and afterwards prune its branches. Pruning step is controlled by min_y and min_rd parameters. We applied grid-search over those parameters (values specified in B.3), for each combination we counted the number of leaves, and for each number of leaves we have taken the tree with minimal .

  • ExKMC: Applied our proposed method for tree construction (Algorithm 1), that minimize at each split with the centers of the near-optimal baseline.

  • ExKMC (base: IMM): Constructed a base tree with leaves according to the IMM algorithm [21]. The tree was expanded with our proposed method (Algorithm 1) that minimize at each split. Both IMM and the ExKMC method used the centers of the near-optimal baseline.

For each combination of parameters, if execution time was more than 1 hour, the execution was terminated and ignored. Execution termination occurred only with CUBT and CLTree over the larger datasets.

b.2 Datasets

Datasets in the empirical evaluation are depicted in Table 1.

Small Datasets
Iris [27] 3 150 4
Wine [26] 3 178 13
Breast Cancer [26] 2 569 30
Digits [41] 10 1,797 64
Mice Protein [33] 8 1,080 69
Anuran Calls [49] 10 7,195 22
Larger Datasets
Avila [24] 12 20,867 10
Covtype [12] 7 581,012 54
20 Newsgroups [36] 20 18,846 1,893
CIFAR-10 [40] 10 50,000 3,072
Synthetic Datasets
Synthetic I 3 5,000 1,000
Synthetic II 30 30,000 1,000
Table 1: Datasets properties

Categorical values and class labels were removed from the datasets. The number of clusters, , was set to be equal to the number of class labels. For the 20newsgroups dataset, documents were converted to vectors by removal English stop-words and construction of word count vectors, ignoring rare terms (with document frequency smaller than ).

Synthetic I dataset.  The dataset of synthetic I is a slight adaptation of the one described in [21], which was designed the highlight the weakness of CART algorithm. It contains points:

  • Two of them are and where is a large number.

  • Half of the remaining points are in the first feature and another random features are also , all the remaining features are

  • The remaining points also have zero in the first feature, and the other random features are set to  and the rest of the features are .

Properties of synthetic II are described in Section A.1.

b.3 Hyper-parameters

Grid search was executed on the following hyper-parameters values:

  • CUBT:

    To construct a tree with leaves the we have set in the prune step, and to verify that only clusters will be constructed we have set in the join step.

  • CLTree:

b.4 Extra experiments

Figure 4 depicted the run time of ExKMC constricting a tree with leaves using a single process. The operations of IMM and ExKMC can be feature-wise paralleled, where KMeans iterations can also be executed in parallel. Figure 6 compare the running times of single processor and four processor. Using four process ExKMC constructs tree with leaves over CIFAR-10 dataset in less than minutes.

(a) Single process
(b) Four processes
Figure 6: Running times of constructing a tree-based clustering with leaves. We consider both a single processor (left) and a parallel version using four processes (right). The y-axis labels differ between left and right graphs. Overall, parallelism improves the running time of ExKMC by 2–3.
Figure 7: CIFAR-10 Convergence Rate. We separately showcase CIFAR-10 because it is an exception to many trends. The algorithms fail to converge to a cost ratio of 1.0 compared to the reference clustering. This is likely because pixels are not good features in this context, since the trees only use a subset of them, one at a time. We also notice that IMM starts with the worst performance, but when expanded using ExKMC, eventually outperforms the competitors (for ).
(a) Iris
(b) Wine
(c) Breast Cancer
(d) Digits
(e) Mice Protein
(f) Anuran
Figure 8: Results of small datasets presented in Figure 2 where is in range .