DeepAI

# How to Find a Good Explanation for Clustering?

k-means and k-median clustering are powerful unsupervised machine learning techniques. However, due to complicated dependences on all the features, it is challenging to interpret the resulting cluster assignments. Moshkovitz, Dasgupta, Rashtchian, and Frost [ICML 2020] proposed an elegant model of explainable k-means and k-median clustering. In this model, a decision tree with k leaves provides a straightforward characterization of the data set into clusters. We study two natural algorithmic questions about explainable clustering. (1) For a given clustering, how to find the "best explanation" by using a decision tree with k leaves? (2) For a given set of points, how to find a decision tree with k leaves minimizing the k-means/median objective of the resulting explainable clustering? To address the first question, we introduce a new model of explainable clustering. Our model, inspired by the notion of outliers in robust statistics, is the following. We are seeking a small number of points (outliers) whose removal makes the existing clustering well-explainable. For addressing the second question, we initiate the study of the model of Moshkovitz et al. from the perspective of multivariate complexity. Our rigorous algorithmic analysis sheds some light on the influence of parameters like the input size, dimension of the data, the number of outliers, the number of clusters, and the approximation ratio, on the computational complexity of explainable clustering.

• 17 publications
• 48 publications
• 46 publications
• 10 publications
• 6 publications
• 19 publications

## 1 Introduction

Interpretation or explanation of decisions produced by learning models, including clustering, is a significant direction in machine learning (ML) and artificial intelligence (AI), and has given rise to the subfield of Explainable AI. Explainable AI has attracted a lot of attention from the researchers in recent years (see the surveys by Carvalho et al.

[5] and Marcinkevičs and Vogt [34]). All these works can be divided into two main categories: pre-modelling [43, 42, 23, 15, 30] and post-modelling [38, 40, 4, 41, 31] explainability. While post-modeling explainability focuses on giving reasoning behind decisions made by black box models, pre-modeling explainability deals with ML systems that are inherently understandable or perceivable by humans. One of the canonical approaches to pre-modelling explainability builds on decision trees [35, 37]. In fact, a significant amount of work on explainable clustering is based on unsupervised decision trees [3, 17, 20, 21, 29, 36]. In each node of the decision tree, the data is partitioned according to some features’ threshold value. While such a threshold tree provides a clear interpretation of the resulting clustering, its cost measured by the standard -means/median objective can be significantly worse than the cost of the optimal clustering. Thus, on the one hand, the efficient algorithms developed for -means/median clustering [1] are often challenging to explain. On the other hand, the easily explainable models could output very costly clusterings. Subsequently, Moshkovitz et al. [36], in a fundamental work, posed the natural algorithmic question of whether it is possible to kill two birds with one stone? To be precise, is it possible to design an efficient procedure for clustering that

• Is explainable by a small decision tree; and

• Does not cost significantly more than the cost of an optimal -means/median clustering?

To address this question, Moshkovitz et al. [36] introduced explainable -means/median clustering. In this scheme, a clustering is represented by a binary (threshold) tree whose leaves correspond to clusters, and each internal node corresponds to partitioning a collection of points by a threshold on a fixed coordinate. Thus, the number of leaves in such a tree is , the number of clusters sought. Also, any cluster assignment can be explained by the thresholds along the corresponding root-leaf path. For example, consider Fig. 1: Fig. 1 shows an optimal -means clustering of a 2D data set; Fig. 1 shows an explainable -means clustering of the same data set; The threshold tree inducing the explainable clustering is shown in Fig. 1. The tree has five leaves, corresponding to 5 clusters. Note that in this model of explainability, any clustering has a clear geometric interpretation, where each cluster is formed by a set of axis-aligned cuts defined by the tree. As Moshkovitz et al. argue, the classical -means clustering algorithm leads to more complicated clusters while the threshold tree leads to an easy explanation. The advantage of the explainable approach becomes even more evident in higher dimensions when many feature values in -means contribute to the formation of the clusters.

Moshkovitz et al. [36] define the quality of any explainable clustering as the “cost of explainability”, that is the ratio of the cost of the explainable clustering to the cost of an optimal clustering. Subsequently, they obtain efficient algorithms for computing explainable clusterings whose “cost of explainability” is for -median and for -means. They also show that the this ratio is at least in both cases. Recently, a series of works has been dedicated to improving these bounds. In the low-dimensional setting, Laber and Murtinho [28] showed an upper bound of and for -median and -means respectively. In general, Makarychev and Shan [33], Gamlath, Jia, Polak, and Svensson [19] Esfandiari, Mirrokni, and Narayanan [14] showed independently a upper bound for -means and a upper bound for -median, while also improving the lower bound for -means to . For low dimensions this was improved by Charikar and Hu [7], who showed an upper bound of for -means.

Our contributions. In this work, we propose a new model for explaining a clustering, called Clustering Explanation

. Our approach to explainability is inspired by the research on robustness in statistics and machine learning, especially the vast field of outlier detection and removal in the context of clustering

[9, 18, 16, 8, 6, 22, 26]. In this model, we are given a -means/median clustering and we would like to explain the clustering by a threshold tree after removing a subset of points. To be precise, we are interested in finding a subset of points (which are to be removed) and a threshold tree such that the explainable clustering induced by the leaves of is exactly the same as the given clustering after removing the points in . For the given clustering, we define an optimal (or best) explainable clustering to be the one that minimizes the size of , i.e. for which the given clustering can be explained by removing the minimum number of points. Thus in Clustering Explanation we measure the “explainability” as the number of outlying points whose removal turns the given clustering into an explainable clustering. The reasoning behind the new measure of cluster explainability is the following. In certain situations, we would be satisfied with a small decision tree explaining clustering of all but a few outlying data points. We note that for a given clustering that is already an explainable clustering, i.e. can be explained by a threshold tree, the size of is 0.

In Fig. 2, we provide an example of an optimal -means clustering of exactly the same data set as in Fig. 1. However, the new explainable clustering is obtained in a different way. If we remove a small number of points (in Fig. 2 these are the 9 red larger points), then the explainable clustering is same as the optimal clustering after removing those 9 points.

We note that Clustering Explanation corresponds to the classical machine learning setting of interpreting a black-box model, i.e. it lies within the scope of post-modeling explainability. Surprisingly, this area is widely unexplored when it comes to rigorous algorithmic analysis of clustering explanation. Consequently, we study Clustering Explanation from the perspective of computational complexity. Our new model naturally raises the following algorithmic questions: (i) Given a clustering, how efficiently can one decide whether the clustering can be explained by a threshold tree (without removing any points)? and (ii) Given a clustering and an integer , how efficiently can one decide whether the clustering can be explained by removing points?

In our work, we design a polynomial time algorithm that resolves the first question. Regarding the second question, we give an algorithm that in time decides whether a given clustering of points in could be explained by removing points. We also give an time -approximation algorithm for Clustering Explanation. That is, we give a polynomial time algorithm that returns a solution set of at most points that are to be removed, whereas any best explainable clustering removes points. Moreover, we provide an efficient data reduction procedure that reduces an instance of Clustering Explanation to an equivalent instance with at most points in with integer coordinates within the range . The procedure can be used to speed up any algorithm for Clustering Explanation, as long as . We complement our algorithms by showing a hardness lower bound. In particular, we show that Clustering Explanation cannot be approximated within a factor of in time , for any functions and , unless Exponential Time Hypothesis (ETH) [25] fails. All these results appear in Section 3.

We also provide new insight into the computational complexity of the model of Moshkovitz et al. [36]. While the vanilla -median and -means problems are NP-hard for [2, 13, 11] or [32], this is not the case for explainable clustering! We design two simple algorithms computing optimal (best) explainable clustering with -means/median objective that run in time and , respectively. Hence for constant or constant , an optimal explainable clustering can be computed in polynomial time. The research on approximation algorithms on the “cost of explainability” in [36, 7, 14, 19, 28, 33] implicitly assumes that solving the problem exactly is NP-hard. However, we did not find a proof of this fact in the literature. To fill this gap, we obtain the following hardness lower bound: An optimal explainable clustering cannot be found in time for any computable function , unless Exponential Time Hypothesis (ETH) fails. This lower bound demonstrates that asymptotically the running times of our simple algorithms are unlikely to be improved. Our reduction also yields that the problem is NP-hard. These results are described in Section 4.

Finally, we combine the above two explainability models to obtain the Approximate Explainable Clustering model: For a collection of points in and a positive real constant , we seek whether we can identify at most outliers, such that the cost of explainable -means/median of the remaining points does not exceed the optimal cost of an explainable -means/median clustering of the original data set. Thus, if we are allowed to remove a small number of points, can we do as good as any original optimal solution? While our hardness result of Section 4 holds for explaining the whole dataset, by “sacrificing” a small fraction of points it might be possible to solve the problem more efficiently. And indeed, for this model, we obtain an algorithm whose running time has a significantly better dependence on and . For example, compare this with the above time bounds of and . This algorithm appears in Section 5. See Table 1 for a summary of all our results.

## 2 Preliminaries

#### k-means/median.

Given a collection of points in and a positive integer , the task of -clustering is to partition into parts , called clusters, such that the cost of clustering is minimized. We follow the convention in the previous work [36] for defining the cost. In particular, for -means, we consider the Euclidean distance and for -median, the Manhattan distance. For a collection of points of , we define

 cost2(X′)=minc∈Rd∑x∈X′∥c−x∥22, (1)

and call the point minimizing the sum in (1) the mean of . For a clustering of , its -means (or simply means) cost is . With respect to the Manhattan distance, we define analogously , which is minimized at the median of , and , which we call the -median (or simply median) cost of the clustering.

#### Explainable clustering.

For a vector

, we use to denote the -th element (coordinate) of the vector for . Let be a collection of points of . For and , we define , where is a partition of with

 X1={x∈X∣x[i]≤θ} % and X2={x∈X∣x[i]>θ}.

Then, given a collection and a positive integer , we cluster as follows. If , then is the unique cluster. If , then we choose and and construct two clusters and , where . For , we select and , and construct a partition of . Then clustering of is defined recursively as the union of a -clustering of and a -clustering of for some integers and such that . We say that a clustering is an explainable -clustering of a collection of points if can be constructed by the described procedure.

#### Threshold tree.

It is useful to represent an explainable -clustering as a triple , called a threshold tree, where is a rooted binary tree with leaves, where each nonleaf node has two children called left and right, respectively, and , where is the set of nonleaf nodes of . For each node of , we compute a collection of points . For the root , . Let be a nonleaf node of and let and be its left and right children, respectively, and assume that is constructed. We compute . If is a leaf, then is a cluster. A clustering is an explainable -clustering of a collection of points if there is a threshold tree such that are the clusters corresponding to the leaves of . Note that is a full binary tree with leaves and the total number of such trees is the -th Catalan number, which is upper bounded by .

For a collection of points and , we denote by the set of distinct values of -th coordinates for . It is easy to observe that in the construction of a threshold tree for a set of points , it is sufficient to consider cuts with ; we call such values of and cuts canonical. We say that a threshold tree for a collection of points is canonical, if for every nonleaf node , where . Throughout the paper we consider only canonical threshold trees.

#### Parameterized complexity and ETH.

A parameterized problem is a subset of , where is a finite alphabet. Thus, an instance of is a pair , where and is a nonnegative integer called a parameter. It is said that a parameterized problem is fixed-parameter tractable () if it can be solved in time for some computable function . The parameterized complexity theory also provides tools to refute the existence of an algorithm for a parameterized problem. The standard way is to show that the considered problem is hard in the parameterized complexity classes or . We refer to the book [10] for the formal definitions of the parameterized complexity classes. The basic complexity assumption of the theory is that for the class , formed by all parameterized fixed-parameter tractable problems, . The hardness is proved by demonstrating a parameterized reduction from a problem known to be hard in the considered complexity class. A parameterized reduction is a many-one reduction that takes an input of the first problem, and in time outputs an equivalent instance of the second problem with , where and are computable functions. Another way to obtain lower bounds is to use the Exponential Time Hypothesis (ETH) formulated by Impagliazzo, Paturi and Zane [24, 25]. For an integer , let be the infimum of the real numbers such that the -Satisfiability problem can be solved in time , where is the number of variables. Exponential Time Hypothesis states that . In particular, ETH implies that -Satisfiability cannot be solved in time .

## 3 Clustering Explanation

#### Clustering explanation.

In the Clustering Explanation problem, the input contains a -clustering of and a nonnegative integer , and the task is to decide whether there is a collection of points with such that is an explainable -clustering. Note that some may be empty here.

### 3.1 A Polynomial-time (k−1)-Approximation

In the optimization version of Clustering Explanation, we are given a -clustering of in , and the goal is to find a minimum-sized subset such that is an explainable clustering. In the following, we design an approximation algorithm for this problem based on a greedy scheme.

For any subset , let . Also, for any subset , define the clustering induced by as . Denote by OPT the size of the minimum-sized subset such that the clustering is explainable. First, we have the following simple observation which follows trivially from the definition of OPT.

###### Observation 1.

For any subset , OPT OPT.

For any cut where and , let and .

###### Lemma 1.

Consider any subset such that contains at least two non-empty clusters. It is possible to select a cut for and , and a subset , in polynomial time, such that (i) each cluster in is fully contained in either or in , (ii) at least one cluster in is in , (iii) at least one cluster in is in and (iv) size of is at most OPT.

Before we prove this lemma, we show how to use it to design the desired approximation algorithm.

#### The Algorithm.

We start with the set of all points . We apply the algorithm in Lemma 1 with to find a cut and a subset such that each cluster in is fully contained in either or in . Let and . We recursively apply the above step on both and separately. If at some level the point set is a subset of a single cluster, we simply return.

The correctness of the above algorithm trivially follows from Lemma 1. In particular, the recursion tree of the algorithm gives rise to the desired threshold tree. Also, the algorithm runs in polynomial time, as each successful cut can be found in polynomial time and the algorithm finds only such cuts that separate the clusters. The last claim follows due to the properties (ii) and (iii) in Lemma 1.

Consider the threshold tree generated by the algorithm. For each internal node , let be the corresponding points and be the points removed from for finding an explainable clustering of the points in . Note that we have at most such nodes. The total number of points removed from for finding the explainable clustering is . By Lemma 1,

 |Wu|≤OPT(Xu).

Now, as , by Observation 1, OPT OPT. It follows that

 ∑u|Wu|≤(k−1)⋅OPT(X).
###### Theorem 1.

There is a polynomial-time -approximation algorithm for the optimization version of Clustering Explanation.

By noting that OPT if is an explainable clustering, we obtain the following corollary.

###### Corollary 1.

Explainability of any given -clustering in can be tested in polynomial time.

###### Proof of Lemma 1.

We probe all possible choices for cuts with and , and select one which incurs the minimum cost. We also select a subset of points to be removed w.r.t. each cut. The cost of such a cut is exactly the size of .

Fix a cut . We have the following three cases. In the first case, for all clusters in , strictly more than half of the points are contained in . In this case select a cluster which has the minimum intersection with . Put all the points in into . Also, for any other cluster , put the points in into . The second case is symmetric to the first one – for all clusters in , strictly more than half of the points are contained in . In this case we again select a cluster which has the minimum intersection with . Put all the points in into . Also, for any other cluster , put the points in into . In both of the above cases, the first three desired properties are satisfied for . In the third case, for each cluster , add the smaller part among and to . In case , we break the tie in a way so that properties (ii) and (iii) are satisfied. As contains at least two clusters this can always be done. Moreover, property (i) is trivially satisfied.

In the above we showed that for all the choices of the cuts, it is possible to select so that the first three properties are satisfied. Let be the minimum size of the set over all cuts. As we select a cut for which the size of is minimized, it is sufficient to show that OPT.

Let be the number of clusters in . Consider any optimal set for such that is explainable. Let be the canonical cut corresponding to the root of the threshold tree corresponding to the explainable clustering . Such a cut exists, as contains at least two clusters. Let be the set selected in our algorithm corresponding to the cut . In the first of the above mentioned three cases, suppose does not contain the part fully for any of the clusters . In other words, contains points from each such part . But, then even after choosing the root cut we still need more cuts to separate the points in , which contains points from all the clusters. However, by definition, the threshold tree must use only cuts and hence we reach to a contradiction. Hence, must be fully contained in for some . In this case, our algorithm adds the points in to such that the size is minimized over all and for any other cluster , we put the points in into . Thus, OPT. The proof for the second case is the same as the one for the first case. We discuss the proof for the third case. Consider the clusters such that both and are non-empty. Note that these are the only clusters whose points are put into . But, then must contain all the points from at least one of the parts and . For each such cluster , we add the smaller part among and to . Hence, in this case also OPT. The lemma follows by noting that . ∎

### 3.2 Exact algorithm

Our time algorithm is based on a novel dynamic programming scheme. Here, we briefly describe the algorithm. Our first observation is that each subproblem can be defined w.r.t. a bounding box in

, as each cut used to split a point set in any threshold tree is an axis-parallel hyperplane. The number of such distinct bounding boxes is at most

, as in each dimension a box is specified by two bounding values. This explains the factor in the running time. Now, consider a fixed bounding box corresponding to a subproblem containing a number of given clusters, may be partially. If a new canonical cut splits a cluster, then one of the two resulting parts has to be removed, and this choice has to be passed on along the dynamic programming. As we remove at most points and the number of clusters is at most , the number of such distinct choices can be bounded by . This roughly gives us the following theorem.

###### Theorem 2.

Clustering Explanation can be solved in time.

Before we move to the formal proof of the theorem, let us introduce some specific notations. Let be such that . We denote and call an interval. For a collection of points , we say that is in if , is outside if , and we say that splits if and .

Let be a family of disjoint collections of points of . A subfamily is said to be -proper if (i) every that is in is in , and (ii) every that is outside is not included in . Note that that are split by may be either in or not in . The truncation of with respect to , is the family

 tr(a,b](X)={X∩(a,b]∣X∈X s.t. X∩(a,b]≠∅}.

For an integer , is -feasible with respect to if splits at most collections in .

###### Proof of Theorem 2.

Let be an instance of Clustering Explanation, where for disjoint collections of points of . Let . Following the proof of Proposition 7, we say that a vector is canonical if for every .

For every pair of canonical vectors such that and is -feasible with respect to , and every -proper , we denote by the minimum size of a collection of points such that , where , is an explainable -clustering. We assume that if is empty. We compute

 w(a,b,S)=ω(a,b,S)+∑Ci∈S|Ci∖(a,b]|+∑Ci∈C∖S|Ci∩(a,b]|. (2)

Since we are interested only in clustering that can be obtained by deleting at most points, we assume that if this value is bigger than . This slightly informal agreement simplifies arguments. In particular, observe the two sums in (2) give the value that is bigger than if is not -feasible with respect to . In fact, this is the reason why these sums are included in (2).

Notice that is a yes-instance of Clustering Explanation if and only if , where and for .

The values are computed depending on . If , that is, , then and . If , then by definition. Then for some such that and

 w(a,b,S)=|Ci∖(a,b]|+∑Cj∈C∖{Ci}|Cj∩(a,b]|.

Assume that , and the values of are computed for .

For and such that , we define the vectors and by setting

 ai,θ[j]={θif j=i,a[j]if j≠i, and bi,θ[j]={θif j=i,b[j]if j≠i.

We also say that is -feasible if is -feasible with respect to and . For an -feasible , a partition of is -proper if and are and -proper, respectively. We define

 δi,θ(S)=∑Ci∈S:Ci∩(a,bi,θ]≠∅and Ci∩(ai,θ,b]≠∅|Ci∩(a,b]|.

We compute by the following recurrence.

 w(a ,b,S)=min{(∗∗)+(∗∗∗)}, (3)

where the right part is denoted by , and

 (∗∗)=min{|Ci∖(a,b]|+∑Cj∈S∖{Ci}|Cj|+∑Cj∈C∖S|Cj∩(a,b]|∣Ci∈S},

We assume that if there is no triple satisfying the conditions in the definition of the set. We also assume that if its value proves to be bigger than .

The correctness of (3) is proved by showing the inequalities between the left and rights parts in both directions.

First, we show that . This is trivial if . Assume that this is not the case. Then by our assumption, . Recall that . Let and let be a collection of points such that , where , is an explainable -clustering. Assume that and . Let for .

Notice that it may happen that for some . Then . Observe, however, that for at most values of . Suppose that there is such that and for every such that . In this case, we obtain that

 ω(a,b,S)=∑Cj∈S∖{Cih}|Cj∩(a,b]|

and

 w(a,b,S)=|Cih∖(a,b]|+∑Cj∈S∖Cih|Cj|+∑Cj∈C∖S|Cj∩(a,b]|.

Then . Assume from now that this is not the case and for at least two distinct indices . Then we show that .

Because we separate at least two nonempty collections of points, the definition of explainable clustering implies that there are and , such that and there is a partition of with the property that (i) is an explainable -clustering with the clusters in , and (ii) is an explainable -clustering with the clusters in , where both and contain nonempty collections of points. Moreover, we assume that if for some , then is placed in if has a point in and, otherwise, i.e. if has only points in , it is placed in .

We define and . For , let , and let . We set and . Let also and . Observe that is a partition of where some sets may be empty. Denote by and , and let and . Clearly, .

Notice that and . Also and are and -proper, respectively. Furthermore, for , is obtained from by deleting the points of from the clusters. Also we have that

 ∑Cj∈S1|Cj∖(a,bi,θ]|=∑Cj∈S1|Cj∖(a,b]|+|R1|, (4)
 ∑Cj∈S2|Cj∖(ai,θ,b]|=∑Cj∈S2|Cj∖(a,b]|+|R2|, (5)

and

 ∑Cj∈C∖S1|Cj∩(a,bi,θ]|+∑Cj∈C∖S2|Cj∩(ai,θ,b]|=∑Cj∈C∖S|Cj∩(a,b]|+|R1|+|R2|. (6)

Note also that

 δi,θ(S)=|R1|+|R2|. (7)

Then by (4)–(7),

 w(a,b,S)= (w1+w2+r1+r2)+∑Cj∈S|Cj∖(a,b]|+∑Cj∈C∖S|Cj∩(a,b]| = (w1+w2+r1+r2)+∑Cj∈S1|Cj∖(a,b]|+∑Cj∈S2|Cj∖(a,b]| +∑Cj∈C∖S1|Cj∩(a,bi,θ]|+∑Cj∈C∖S2|Cj∩(ai,θ,b]|−2r1−2r2 = w1+∑Cj∈S1|Cj∖(a,bi,θ]|+∑Cj∈C∖S1|C