Interpretable_Clustering
Python Codebase for Interpretable Clustering, published at AIES'20
view repo
Graph clustering groups entities – the vertices of a graph – based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support systems hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of interest that signify interpretability to an end-user, by optimizing interpretability in addition to common clustering objectives. We propose a β-interpretable clustering algorithm that ensures that at least β fraction of nodes in each cluster share the same feature value. The tunable parameter β is user-specified. We also present a more efficient algorithm for scenarios with β=1 and analyze the theoretical guarantees of the two algorithms. Finally, we empirically demonstrate the benefits of our approaches in generating interpretable clusters using four real-world datasets. The interpretability of the clusters is complemented by generating simple explanations denoting the feature values of the nodes in the clusters, using frequent pattern mining.
READ FULL TEXT VIEW PDF
Recent work on explainable clustering allows describing clusters when th...
read it
Combinatorial optimization problems for clustering are known to be NP-ha...
read it
We present a volume exploration framework, FeatureLego, that uses a nove...
read it
State-of-the-art clustering algorithms use heuristics to partition the
f...
read it
Clustering is a difficult and widely-studied data mining task, with many...
read it
We study the problem of clustering nodes in a dynamic graph, where the
c...
read it
Operating envelope is an important concept in industrial operations. Acc...
read it
Python Codebase for Interpretable Clustering, published at AIES'20
Graph clustering is increasingly used as an integral part of automated decision support systems for high-stake applications such as infrastructure development (Hospers et al., 2009), criminal justice (Aljrees et al., 2016), and health care (Haraty et al., 2015)
. Such domains are characterized by high-dimensional data and the goal of clustering is to group these nodes, typically based on similarity over all the features
(Jain et al., 1999). The solution quality of the resulting clusters is measured by the objective value. As the number of features increases, it is increasingly difficult for an end-user to interpret the resulting clusters.For example, consider the problem of clustering districts in Kenya to aid decision-making for infrastructure development (Figure 1), sanitation in particular (ICT, 2017; Authority, 2017). Each district is described by features denoting the population, access to basic sanitation, gender and age demographics, and location. The districts in a cluster are typically considered to be indistinguishable and hence may be assigned the same development policies. The similarity of districts for clustering is measured based on all the features. As a result, it is likely that the cluster composition is heterogeneous with respect to the sanitation feature (Figure 1(a)). This may significantly affect the decision-maker’s ability to infer meaningful patterns, especially due to lack of ground truth, thereby affecting their policy decisions.
Recently, there has been growing interest in interpretable machine learning models
(Doshi-Velez and Kim, 2017; Lakkaraju et al., 2019; Rudin, 2019), mostly focusing on explainable predictive models or interpretable neural networks. There is limited prior research, if any, on improving the interpretability of clusters
(Bertsimas et al., 2018; Chen et al., 2016). Clustering results are expected to be inherently interpretable as the aim of clustering is to group similar nodes together. However, when clustering with a large number of features, interpretability may be diminished since no clear patterns may be easy to recognize for an end-user, as in Figure 1(a).Interpretability of the clusters is critical in high-impact domains since decision makers need to understand the solution beyond how the data is grouped into clusters: what characterizes a cluster and how it is different from other clusters. Additionally, the ability of a decision maker to evaluate the system for fairness and identify when to trust the system hinges on the interpretability of the results. In this work, the interpretability of clusters is measured based on the homogeneity of nodes in each cluster, with respect to certain predefined feature values of interest (FoI) in the data to the end-user.
Solution quality of the clusters, denoted by the objective value, and interpretability are often competing objectives. For example in Figure 1(b), interpretability is optimized in isolation by partitioning the nodes only based on FoI, which significantly affects the solution quality and optimizing for solution quality affects interpretability (Figure 1(a)). Reliable decision support requires interpretable clusters, without significantly compromising the solution quality.
In this paper, we study the problem of optimizing for interpretability of clusters, in addition to optimizing the solution quality of centroid-based clustering algorithms such as k-center. We propose a -interpretable clustering algorithm that generates clusters such that at least fraction of nodes in each cluster share the same feature value, with respect to FoI. The value is a user-specified input. By adjusting the value of , the homogeneity of the nodes in the cluster with respect to FoI can be altered, thus facilitating balancing the trade-off between solution quality and interpretability (Figure 1(c)). We then present a more efficient algorithm to specifically handle settings with and bound the loss in solution quality of centroid-based clustering objectives, when optimizing for interpretability.
While interpretable clusters are a minimal requirement, it may not be sufficient to guarantee interpretability of the system, due to the cognitive overload for users in understanding the results. Hence, the resulting clusters are complemented by logical combinations of cluster labels as explanations. The feature values of the nodes in the cluster, with respect to FoI, are generated as cluster labels, using frequent pattern mining. In Figure 1(d), traditional clustering produces longer explanations, which are generally undesirable (Doshi-Velez and Kim, 2017), and optimizing for interpretability produces concise explanations. Thus, generating interpretable clusters is crucial for generating concise and useful explanations.
Our primary contributions are: (i) formalizing the problem of interpretable clustering that optimizes for interpretability, in addition to solution quality (Section 2); (ii) presenting two algorithms to achieve interpretable clustering and analyzing their theoretical guarantees (Section 3); and (iii) empirical evaluation of our approaches using four real-world datasets and using frequent pattern mining to generate cluster explanations (Section 4). Our experiments demonstrate the efficiency of our approaches in balancing the trade-off between interpretability and solution quality. The results also show that clusters with different levels of interpretability can be generated by varying .
Let denote a set of nodes, along with a pairwise distance metric . Let denote the set of values of features where refers to the set of values for the i-th feature, and denote the mapping from nodes to the feature values. Let be a graph where is a metric over . Given a graph instance and an integer , the goal is to partition into disjoint subsets by optimizing an objective function, which results in clusters . The objective function (), for a graph and a set of clusters , returns an objective value as a real number, , which helps compare different clustering techniques. The optimal objective value of an objective function is denoted by . denotes the cluster to which the node is assigned.
The clusters produced by the existing algorithms are often non-trivial and non-intuitive to understand for an end-user due to the complex feature space. Let denote the set of features in that signify interpretability for the user, denoting the feature values of interest (FoI). In Figure 1, is the sanitation feature and {0-25%, 25-50%, 50-75%, 75-100%}, denoting the four feature values of access to basic sanitation.
Quantifying Interpretability: Interpretability score of a cluster with respect to a feature value is denoted by
and estimated based on the fraction of the nodes in the cluster that share the feature value,
:with denoting whether the node satisfies feature value and denoting the total number of nodes in the cluster. Hence, . Given , the interpretability score of a cluster, , is calculated as
The interpretability score of a clustering , given , is denoted by and is calculated as:
Problem Statement: Given , we aim to create clusters that maximize the interpretability score, , while simultaneously optimizing for solution quality using centroid-based clustering objectives such as k-center. k-center clustering aims to identify k nodes as centers (say , ) and assign each node to the closest cluster center ensuring that the maximum distance of any node from its cluster center is minimized. The objective value is calculated as:
A clustering is -interpretable, given , if . That is, each cluster is composed of at least fraction of nodes that share the same feature value.
A clustering is strongly interpretable, given , if .
We now analyze the maximum achievable interpretability for a given dataset and identify the upper bound on .
Let denote the optimal upper bound of . Without loss of generality, given a feature value , we assume that there exists at least one node that satisfies . When , with denoting the number of clusters, a clustering can be generated such that . This is achieved by constructing each cluster with the nodes that satisfy the same feature value , and hence .
However, when , there exists no clustering with , since the optimal solution cannot form clusters with nodes satisfying only one feature value of interest. Hence, in such cases, . The optimal value for this case can be estimated as follows: consider the top-k features based on frequency of occurrence in the data and assign the nodes that refer to each of these features to a different clusters. All the remaining unassigned nodes are then iteratively assigned to the cluster with maximum interpretability score. If multiple clusters have the same interpretability score, the new node is added to the cluster with larger size, since it is less likely to negatively affect the interpretability score.
In general, the interpretability score of a cluster is dominated by the feature value satisfied by maximum number of nodes within . For a given cluster , interpretability can be boosted by either adding more nodes of the majority feature or removing the nodes that are different from the majority feature. If all the nodes that do not represent the majority feature are removed, the interpretability score of is 1. Using this intuition, we propose algorithms to generate -interpretable cluster, when .
In many applications, the clusters that are considerably homogeneous but not strongly interpretable may still be acceptable since a few outliers do not affect the decision maker’s abilities to infer a pattern. For example, if the nodes in a cluster are 90% homogeneous, the interpretability may not be significantly affected. However, this may help with improving the solution quality of the clusters formed using centroid-based objectives. To that end, we propose an algorithm (Algorithm
1) in which the homogeneity of the nodes in a cluster can be adjusted using a tunable parameter . The algorithm identifies interpretable clusters for all values of . We present the algorithms using k-center as the clustering objective. However, it is straightforward to extend the algorithms to any other centroid-based clustering.The input to Algorithm 1 is a graph , the parameters and , referring to the number of clusters needed and the interpretability score requirement. First, it initializes a collection of clusters, with the greedy k-center algorithm and optimizes the quality of clusters generated. In order to improve interpretability score, our algorithm iteratively identifies a cluster with the least interpretability score and then post-processes it to improve its interpretability scores without considerable loss in the k-center objective. While processing , a feature value associated with maximum number of nodes in is identified as the ‘majority’ feature value along with a set corresponding to the collection of nodes that share the majority value. To boost the score of , the fraction of nodes that share the majority feature needs to be increased. We employ the following two operations for this purpose:
The total number of nodes with majority feature are increased (boost_majority); and
The nodes that do not correspond to the majority feature value in are removed from and re-assigned to other clusters (reduce_minority).
boost_majority. Outlined in Algorithm 2, this subroutine iterates over the clusters to identify the closest cluster that contains the nodes with the ‘majority feature’ and merges with (Line 1,2). It then identifies two different features that have the maximum frequency within the merged cluster and assigns these features to two different clusters and (Line 2). The remaining nodes in the merged cluster are assigned to either of the two clusters such that and have comparable interpretability scores (Line 4,5).
reduce_minority. This subroutine, outlined in Algorithm 3, identifies the collection of nodes within that do not have the ‘majority’ feature, which when removed help boost the interpretability score of (Line 1). Nodes which do not belong to the majority feature and are farthest from the center are considered for re-assignment (Lines 2,3). Each of farthest node is then assigned to clusters , considered in increasing order of distance from such that the interpretability score of does not reduce below (Line 4). This process of removing nodes from is performed only when has the maximum number of nodes present in the data set that share the majority feature.
In some cases, Algorithm 1 may converge to a local maxima and may not reach , when the input . This happens when the feature value being boosted is not one of the feature values in the optimal solution. However, we observe that this is a rare scenario in practice. A detailed algorithm that works in these cases is described in the Appendix.
For cases in which the minimum distance pair identified in Algorithm 2 belong to same optimal cluster, we bound the loss in k-center objective when using boost_majority.
In each iteration of boost_majority where the minimum distance pair identified in Algorithm 2 belong to same optimal cluster,, the k-center objective value worsens by , where and denotes the optimal k-center objective value of the clusters that achieve maximum interpretability.
Proof in Appendix.
When generating clusters with , Algorithm 1 may take long to converge, especially if the initial k-center based clusters have poor interpretability. We propose a more efficient algorithm for strong interpretability that solves the interpretable clustering problem on each individual features to construct the final solution.
Algorithm 4 is a more efficient approach to handle scenarios with . At a high-level, it identifies the distribution of feature values among clusters and then quickly generates the clusters. It leverages the property that a clustering with is characterized by clusters such that all nodes in a cluster share the same feature value. As discussed earlier, is achievable only when and this is an important assumption required for this algorithm.
The first step is to identify a set which consists of a tuple of values that sum up to (Line 2). This set identifies all possible distributions of the different feature values under consideration for interpretability (FoI) among the clusters. For each value , it identifies clusters for nodes with feature . The collection of these clusters refer to the solution corresponding (Lines 3-5). This step generates collection of k-clusters and the one with minimum k-center objective value is chosen as the final set of clusters (Line 6).
Algorithm 4 is capable of generating clusters with high interpretability, without significant loss in the clustering objective value. We now show that the final solution returned by our algorithm is a 2-approximation of the optimal algorithm that generates interpretable clusters and optimizes for the k-center objective.
The strong-interpretability clustering algorithm generates such that and , where refers to the k-center of objective of and denotes the optimal k-center objective value of clusters that achieve maximum interpretability.
Since each cluster contains all nodes that share the same feature value, . Additionally, the optimal solution has a distribution of features . The solution is a 2-approximation of (following the proof of 2-approximation of greedy algorithm for k-center). Since, the final solution chooses that minimizes the k-center objective over all possible clustering in , it is guaranteed that is a 2-approximation of . ∎
We evaluate the efficiency of our approaches based on two metrics: interpretability score of the clustering and the objective value of k-center algorithm. We refer to Algorithm 1 as -IC and Algorithm 4 as .
Baselines The results are compared with that of three baselines: 1. k-center clustering over all the features in the data (); 2. partitioning the dataset into k clusters based on the FoI (); and 3. k-center clustering over only the features of interest for interpretability (). and represent extremes of the spectrum, optimizing only for k-center objective or interpretability. aims to optimize for the distances, ensuring that the nodes with similar features are present close to each other.
Datasets The algorithms are evaluated using four datasets: 1. Kenya sanitation data in which the interpretability is defined over % population in a district with access to basic sanitation; 2. Kenya traffic accidents data^{1}^{1}1https://www.opendata.go.ke/datasets/2011-traffic-incidences-from-desinventar, whose interpretability is measured based on the accident type; 3. Adult dataset^{2}^{2}2https://archive.ics.uci.edu/ml/machine-learning-databases/adult/ in which the interpretability of the clusters is defined based on the age and income of the population; and 4. Crime data^{3}^{3}3http://archive.ics.uci.edu/ml/datasets/communities+and+crime, with FoI as the median family income of the communities.
Setup All algorithms are implemented in Python and tested on an Intel i7 computer with 8GB of RAM. In the interest of clarity, we experiment with for all domains. Due to randomness in the k-center algorithm, the clustering objective behavior of our techniques may not be monotonic. For any given , we run the algorithm for different values and choose the best clustering returned.
We first study the trade-off between the k-center objective value and and the interpretability score of the clusters. We vary for Algorithm 1 and compare the results with that of the baselines and that of Algorithm 4, with fixed . The results in Figure 2 show how the k-center objective value may be affected as we form increasingly interpretable clusters using our algorithms. We do not distinguish between the performances with various values, denoted by the purple markers, since the goal is to understand how the algorithm balances the trade-off for any value. We also do not consider since that defeats the purpose of optimizing for interpretability. Note that our algorithm supports any value of as input.
Approaches that minimize k-center objective and maximize the interpretability, lower right corner of the figure, are desirable. Overall, the baselines either achieve high interpretability with poor k-center objective or low k-center objective with a very low interpretability. Our approach has a better balance between them since the clusters generated by our algorithm have high interpretability, without significant loss in k-center objective. With the increase in values, the k-center objective worsens but the loss in k-center objective is not high and is within a factor of 5 in most cases. The runtime of is at most 40 seconds across all datasets and the runtime of our approach is at most 65 seconds across all datasets and all values of . This shows that there is no significant overhead in optimizing for both interpretability and solution quality.
As discussed above, it is evident that our approaches efficiently balance the trade-off even for higher values of . We now study the effects of varying on the cluster composition. Figure 3 shows the distribution of FoI in each cluster for different values of for the Adult dataset. In the interest of readability, we do not include results for lower values of . With the increase in , the fraction of majority feature in each cluster grows. For example, the nodes represented by yellow color are merged as is increased from to . Similarly, when is increased from to , the green colored feature is a minority, which are merged to form a new cluster. Notice that all the green feature nodes are not merged and this process stops as soon as the clusters reach interpretability of . However, in the case of strong-interpretability with , the clusters are homogeneous. In our experiments, the runtime with is at most twice as that of and the runtime of is consistently lower than -IC for . Similar trends were observed for other domains and the results are included in the Appendix.
To ensure that the trend in the relative performances of the approaches in minimizing the k-center objective are consistent, we experiment with varying the number of clusters , and with fixed . Figure 4 plots the results of the approaches for varying from 10 to 50. As expected, the k-center objective value decreases with the increase in and the relative behavior of all the techniques is consistent. Our techniques -IC and IKC are close to KC across all datasets, while works well for accidents and sanitation datasets only. All techniques run in less than three hours for all datasets and for all values of in the experiments, demonstrating the scalability of our approach.
The interpretability of the resulting clusters can be further improved by generating explanations based on the feature values of the nodes in the clusters. Concise and correct explanations based on FoI are possible only when the clusters are homogeneous with respect to FoI. Hence, generating explanations also allows us to understand and compare the performance of different techniques beyond the interpretability scores.
We generate explanations as logical combinations of the feature values of FoI associated with the nodes in each cluster, using frequent pattern mining (Han et al., 2007). This is implemented using Python pyfpgrowth package with a minimum support value of of the cluster size. That is, this approach lists all the feature values that are associated with at least of the nodes in the cluster and is the tolerance for outliers in the cluster. This value can be adjusted depending on the application. Explanations are then generated by a logical OR over these feature values. Figure 5 shows the distribution of explanations across clusters for different techniques on the Adult data, with . Clusters generated by the
approach contain a skewed distribution of features across all clusters and are hard to interpret, with respect to FoI. Approaches that focus on interpretability have generated homogeneous clusters with majority of the nodes in a cluster sharing the same feature value. As a result, the generated explanations for these approaches are concise and fairly different across the clusters, thereby improving the interpretability for the decision maker.
Interpretable machine learning The two main threads of research in interpretable machine learning are generating explanations for black-box models (Abdul et al., 2018; Holzinger, 2018; Gunning, 2017; Guidotti et al., 2019; Lakkaraju et al., 2019) and improving the transparency with interpretable models (Doshi-Velez and Kim, 2017; Rudin, 2019; Chen et al., 2018). Most of these approaches have been developed for predictive models or for interpretable neural networks and have heavily relied on domain-dependent notions of interpretability (Doshi-Velez and Kim, 2017). We define a domain-independent notion of interpretability and aim to form interpretable clusters, which is critical for high-impact applications (Rudin, 2019). We argue that generating explanations for clustering requires homogeneous clusters and propose algorithms that improve the interpretability without compromising on the solution quality.
Clustering with multiple objectives Prior research on clustering focuses heavily on improving the performance metrics (Aggarwal and Wang, 2010; Jain et al., 1999; Xu and Wunsch, 2005), such as accuracy, scalability and runtime, but neglect the interpretability aspect. Another thread of work employs soft clustering methods (Chen et al., 2016; Greene and Cunningham, 2005) or mixed integer optimization (Bertsimas et al., 2018) to improve interpretability but do not provide any solution guarantees. Constrained clustering (Wagstaff et al., 2001), in which the pairs of nodes that must belong to the same cluster are enforced as constraints, cannot be used to generate interpretable clusters when . Another related body of work is the research on multi-objective clustering (Chierichetti et al., 2017; Law et al., 2004; Jiamthapthaksin et al., 2009; Handl and Knowles, 2007; Bera et al., 2019) that has been predominantly applied for specific applications and recently for improving fairness. Extending these approaches to our setting is not straightforward since the algorithms are problem-specific. There is limited research on interpretable clustering (Chen et al., 2016) since clusters are expected to be interpretable as they group similar nodes, which is not necessarily the case when dealing with high-dimensional data.
We address the challenge of generating interpretable clustering, while simultaneously optimizing for solution quality of the resulting clusters. We propose an algorithm to generate -interpretable clusters, given and the features of interest that signify interpretability to the user. A more efficient algorithm specifically to handle scenarios with is also presented, along with the theoretical guarantees of the two approaches. Our approaches efficiently balance the trade-off between interpretability and solution quality, compared to the baselines. The proposed approach can be extended to handle continuous FoI by treating each interval of continuous values as a discrete value for -interpretability.
We currently target settings in which clustering is performed using centroid-based algorithms. In the future, we aim to expand the range of clustering objectives considered, including hierarchical clustering, and analyze their theoretical guarantees. Using interpretable clustering to identify bias in decision-making is another interesting direction for future research.
Criminal pattern identification based on modified k-means clustering
. In IEEE International Conference on Machine Learning and Cybernetics (ICMLC), Vol. 2, pp. 799–806. Cited by: §1.This looks like that: deep learning for interpretable image recognition
. arXiv preprint arXiv:1806.10574. Cited by: §5.Explainable artificial intelligence
. Defense Advanced Research Projects Agency (DARPA). Cited by: §5.IEEE Transactions on Evolutionary Computation
11 (1), pp. 56–76. Cited by: §5.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §5.In a particular iteration, let the clusters identified to be merged in boost_majority be and with such that and be the edge with minimum distance. Since are present in the same optimal cluster, then , where is the k-center objective for the optimal clustering that optimizes for IC and k-center. Hence . The final clusters constructed have nodes from and . The maximum pairwise distance between any pair of points such that and can be evaluated using triangle inequality. We use the property that the maximum distance between any pair of points within same cluster of radius is .
This shows that the pairwise distance between any pair of points within is less than . Hence, the final cluster output by boost_majority will be 10-approximation of the optimal solution in the worst case. ∎
As discussed in the main paper, Algorithm 1 (in the paper) may converge at a local maxima and never achieve . Even though this is a rare scenario, we present Algorithm 5 as a different subroutine which can be run along with boost_majority and reduce_minority subroutines to modify the clusters and identify clustering with higher interpretability. This subroutine first identifies a feature which is present as a majority feature in one of the optimal clusters but is not present as a majority in any of the clusters (Line 2). Since this feature needs to be a majority, we identify a cluster that is closest to and contains nodes of feature . All nodes with feature in are added to and then the feature nodes are boosted to become majority by calling boost_majority subroutine ensuring that it boosts nodes that belong to feature .
Comments
There are no comments yet.