Log In Sign Up

Interpretable & Explorable Approximations of Black Box Models

We propose Black Box Explanations through Transparent Approximations (BETA), a novel model agnostic framework for explaining the behavior of any black-box classifier by simultaneously optimizing for fidelity to the original model and interpretability of the explanation. To this end, we develop a novel objective function which allows us to learn (with optimality guarantees), a small number of compact decision sets each of which explains the behavior of the black box model in unambiguous, well-defined regions of feature space. Furthermore, our framework also is capable of accepting user input when generating these approximations, thus allowing users to interactively explore how the black-box model behaves in different subspaces that are of interest to the user. To the best of our knowledge, this is the first approach which can produce global explanations of the behavior of any given black box model through joint optimization of unambiguity, fidelity, and interpretability, while also allowing users to explore model behavior based on their preferences. Experimental evaluation with real-world datasets and user studies demonstrates that our approach can generate highly compact, easy-to-understand, yet accurate approximations of various kinds of predictive models compared to state-of-the-art baselines.


page 1

page 2

page 3

page 4


"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations

As machine learning black boxes are increasingly being deployed in criti...

Interpretable and Interactive Summaries of Actionable Recourses

As predictive models are increasingly being deployed in high-stakes deci...

Sentence-Based Model Agnostic NLP Interpretability

Today, interpretability of Black-Box Natural Language Processing (NLP) m...

ProtoShotXAI: Using Prototypical Few-Shot Architecture for Explainable AI

Unexplainable black-box models create scenarios where anomalies cause de...

VINE: Visualizing Statistical Interactions in Black Box Models

As machine learning becomes more pervasive, there is an urgent need for ...

ProtoX: Explaining a Reinforcement Learning Agent via Prototyping

While deep reinforcement learning has proven to be successful in solving...

Explainable Knowledge Graph Embedding: Inference Reconciliation for Knowledge Inferences Supporting Robot Actions

Learned knowledge graph representations supporting robots contain a weal...

1. Introduction

The successful adoption of predictive models in settings such as criminal justice and health care hinges on how much judges and doctors can understand and trust the functionality of these machine learning models. Only if decision makers have a clear understanding of the behavior of predictive models, they can evaluate when and how much to depend on these models, detect potential biases in them, and develop strategies for further model refinement. However, the increasing complexity of predictive models is making it harder to explain or reason about their behavior 

(ribeiro2016should, ), thus, emphasizing the need for tools which can explain the complex behavior of predictive models in a faithful and interpretable manner.

Prior research on interpretable machine learning mainly focused on learning predictive models from scratch which were human understandable. Examples of such models include decision trees 

(rokach2005top, ), decision lists (letham2015interpretable, ), decision sets (lakkarajuinterpretable, ), linear models, generalized additive models (lou2012intelligible, ) etc. More recently, Ribeiro et. al. (ribeiro2016should, ) and Wei et. al. (koh2017understanding, ) proposed approaches to explain individual predictions of any black box classifier. Ribeiro et. al. (ribeiro2016should, ) proposed an approach which explains individual predictions of any classifier by generating locally interpretable models. They then approximate the global behavior of the classifier by choosing certain representative instances and their corresponding locally interpretable models. This approach, however, does not clearly specify which of the multiple locally interpretable models are applicable to which part of the feature space.

If Age 50 and Male Yes:
            If Past-Depression Yes and Insomnia No and Melancholy No, then Healthy
            If Past-Depression Yes and Insomnia Yes and Melancholy Yes and Tiredness Yes, then Depression
If Age 50 and Male No:
            If Family-Depression Yes and Insomnia No and Melancholy Yes and Tiredness Yes, then Depression
            If Family-Depression No and Insomnia No and Melancholy No and Tiredness No, then Healthy
            If Past-Depression Yes and Tiredness No and Exercise No and Insomnia Yes, then Depression
            If Past-Depression No and Weight-Gain Yes and Tiredness Yes and Melancholy Yes, then Depression
           If Family-Depression Yes and Insomnia Yes and Melancholy Yes and Tiredness Yes, then Depression
Figure 1.

Explanations generated by our approach on depression dataset when approximating a deep neural network

Here, we study the problem of constructing global explanations of black box classifiers. Our goal is to explain the behavior of any given black-box classifier as a whole (i.e., globally) instead of just reasoning about its individual predictions. To this end, we propose a framework BETAwhich constructs a small number of compact decision sets (sets of if-then rules) each of which captures the behavior of the given black box model in certain parts of the feature space (see Figure 1). To ensure that the resulting explanations are faithful to the original model, we choose approximations based on how well they mimic the original model in terms of assigning class labels to instances. Our framework also unambiguously specifies the rationale used for assigning labels to instances in any part of the feature space by ensuring that each decision set and the corresponding decision rules explain non-overlapping parts of the feature space. To ensure that the resulting explanations are interpretable, we not only employ an intuitive rule based representation but also focus on minimizing its complexity in terms of the number of rules, predicates etc. Our framework also allows users to explore how the original model behaves in subspaces characterized by different values of the features that are of interest to the user.

To address the problem at hand, we propose a novel optimization problem which incorporates all the aforementioned aspects. While exactly optimizing our objective is an NP-hard problem, it has a specific structure which allows for provably near-optimal solutions. In particular, we prove that our optimization problem is a non-normal, non-monotone submodular function with matroid constraints. We then employ an efficient optimization procedure based on approximate local search (lee2009non, ) which provides the best known approximation guarantees ( 1/5) to solve our optimization problem. Experimental results on a real-world depression diagnosis dataset indicate that our approach can generate much less complex and high fidelity approximations compared to state-of-the-art baselines. We also carried out user studies in which we asked human subjects to reason about a black box model’s behavior using the approximations generated by our approach and other state-of-the-art baselines. Results of this study demonstrate that the approximations generated by our approach allow humans to accurately and quickly reason about the behavior of complex predictive models.

2. Our Framework

In this work, the goal of creating approximations which can meaningfully explain the behavior of any black box model is guided by the following properties:

Fidelity: The approximation should correctly capture the black box model behavior in all parts of the feature space. While different notions of fidelity can be defined, one possible way this can be achieved is through the labels assigned by the approximation matching the labels assigned by the black box model for most instances (ideally all instances) in the data.

Unambiguity: The approximation should provide a single, deterministic rationale for explaining the prediction of every instance in the data and consequently should unambiguously specify the rationale used for assigning labels to instances in any part of the feature space.

Interpretability: The approximation that we construct should be human-understandable. While choosing an interpretable representation (e.g., rule based models, linear models, decision trees/sets) is a minimal requirement, it is not sufficient to ensure interpretability. Cognitive limitations of humans place restrictions on the complexity of the approximations that are understandable to humans. For example, a decision tree with a hundred levels cannot be considered interpretable. Therefore, it is important to not only have an intuitive representation but also to have smaller complexity (e.g., fewer rules in case of rule based models, fewer features with non-zero coefficients in case of linear models).

Interactivity Users might want to understand the decision logic in subspaces characterized by certain feature values (e.g., How does the model behave for patients over the age of 50 vs. patients under the age of 30?). In this case, a generic explanation of the behavior of the black box model may not be ideal – the features the user is interested in may not even appear in this generic explanation. This scenario highlights the need for customized approximations which allow users to explore the behavior of black box models based on their preferences.

2.1. Our Representation: Two Level Decision Sets

We choose two level decision sets as the representation of our approximations. The basic building block of this structure is a decision set which is a set of if-then rules that are unordered. The two level decision set can be regarded as a set of multiple decision sets, each of which is embedded within an outer if-then structure, such that the inner if-then rules represent the decision logic employed by the black box model while labeling instances within the subspace characterized by the conditions in the outer if-then clauses. Consequently, we refer to the conditions in the outer if-then rules as neighborhood descriptors and the inner if-then rules as decision logic rules.

While the expressive power of two level decision sets is the same as that of other rule based models (e.g., decision sets\lists\trees), the nesting of if-then clauses in a two level decision set representation enables the optimization algorithm (discussed later) to select neighborhod descriptors and decision logic rules such that higher fidelity can be obtained with minimal complexity thus resulting in more compact approximations compared to conventional decision sets (more details in experiments section). In addition, two level decision set representation does not have the pitfalls associated with decision lists where understanding a particular rule requires reasoning about all the previously encountered rules because of the if-else-if construct (lakkarajuinterpretable, ).

Definition 1.

A two level decision set is a set of rules where and are conjunctions of of the form (eg., ) and is a class label. corresponds to the subspace descriptor and together represent the inner if-then rules (decision logic rules) with denoting the condition and denoting the class label. A two level decision set assigns a label to an instance as follows: if satisfies exactly one of the rules i.e., satisfies , then its label is the corresponding class label . If satisfies none of the rules in , then its label is assigned using a default function and if satisfies more than one rule in then its label is assigned using a tie-breaking function. 111Note that the optimization problem that we formulate in Section 2.2.2 will ensure that the need to invoke default or tie-breaking functions is minimized.

In our experiments, we employ a default function which computes the majority class label (assigned by the black box model) of all the instances in the training data which do not satisfy any rule in and assigns them to this majority label. For each instance which is assigned to more than one rule in , we break ties by choosing the rule which has a higher agreement rate with the black box model. Other forms of default and tie-breaking functions can be easily incorporated into our framework.

2.2. Black Box Explanations through Transparent Approximations

Next, we show how to quantify the desiderata presented earlier in the context of two-level decision sets, then formulate it as an objective function and propose an optimization procedure.

2.2.1. Quantifying Fidelity, Unambiguity, and Interpretability

Table 1 shows how we can quantify the properties discussed earlier w.r.t a two level decision set approximation , a black box model , and a dataset where captures the feature values of instance . We treat the black box model as a function which takes an instance as input and returns a class label.

Quantifying Fidelity: disagreement() quantifies the infidelity of approximation to the black box model by summing up for each rule in , the number of instances which satisfy but for which the label assigned by the black box model does not match the label .

Quantifying Unambiguity: For every pair of rules and in where , we compute the number of instances which satisfy both and , sum up all these counts. This sum is denoted by ruleoverlap(). Furthermore, it is important that the approximation that we generate explain or cover as much of the feature space (ideally, all of it) as possible. This notion is captured by cover(), which is the number of those instances which satisfy the condition associated with some rule in .

Quantifying Interpretability size() is the number of rules (triples of the form ) in the two level decision set .
maxwidth() is the maximum width computed over all the elements in , where each element is either a condition of some decision logic rule or a neighborhood descriptor .numpreds() counts the number of predicates in including those appearing in both the decision logic rules and neighborhood descriptors. Note that the predicates of neighborhood descriptors are counted multiple times as a neighborhood descriptor could potentially appear alongside multiple decision logic rules. numdsets() is the number of unique neighborhood descriptors (outer if clauses) in .

In a two-level decision set, each neighborhood descriptor characterizes a specific region of the feature space and the corresponding inner if-then rules specify the decision logic of the black box model within that region. To make this distinction clear, we minimize the number of overlapping features. For every pair of a unique neighborhood descriptor and a decision logic rule , we compute the number of features that occur in both and () and then sum up these counts. The resulting sum is denoted as featureoverlap().

Interpretability ): number of rules (triples of the form ) in
Table 1. Measures for Fidelity, Interpretability and Unambiguity

2.2.2. Optimization Problem

We assume we are given as inputs a dataset , labels assigned to instances in by black box model , a set of possible class labels , a candidate set of conjunctions of predicates (Eg., Age 50 and Gender = Female) from which we can pick the neighborhood descriptors, and another candidate set of conjunctions of predicates from which we can choose the decision logic rules. In practice, a frequent itemset mining algorithm such as apriori (agrawal1994fast, ) can be used to generate the candidate sets of conjunctions of predicates. Without any input from the user, both and are assigned to the same candidate set generated by Apriori. On the other hand, if the user is interested in exploring the behavior of the black box model w.r.t some features (eg., exercise and smoking) is initialized to conjunctions from the candidate set comprising only of the features in .

In order to facilitate theoretical analysis, the metrics from Section 2.2.1 are expressed in the objective function either as non-negative reward functions or constraints. To construct non-negative reward functions, penalty terms (metrics defined previously) are subtracted from their corresponding upper bound values (, , , ) which are computed with respect to and .

where is the maximum width of any rule in either candidate sets. The resulting optimization problem is:


are non-negative weights which manage the relative influence of the terms in the objective. These can be specified by an end user or can be set using cross validation. The values of are application dependent and need to be set by an end user.

Theorem 2.1 ().

The objective function in Eqn. 1 is non-normal, non-negative, non-monotone, submodular and the constraints of the optimization problem are matroids.

Proof (Sketch).

The objective function is non-negative: the first term in the functions is an upper bound on the value that can be taken by the second term ensuring non-negativity. In the case of , the metric cover cannot be negative as it denotes the number of instances in the data that satisfy some rule in the approximation. . Since one of the terms is non-normal and objective is a non-negative linear combination, the objective function is non-normal. In order to prove the objective is non-monotone, let us consider the function and two approximations and such that i.e., has at least as many rules as . Therefore, by definition of metric, which implies that . Since is non-monotone and so is the entire objective function. Last, the functions and are modular and the other three functions in the objective turn out to be submodular. The constraints of the optimization problem are matroids because they satisfy the following two properties: 1) empty set satisfies each of the constraints 2) If approximations , such that satisfies the constraints, then where also satisfies the constraints. ∎

Corollary 2.2 ().

The optimization problem in Eqn. 1 is NP-Hard.

Proof (Sketch).

The objective function in Eqn. 1 is submodular and maximizing a submodular function is NP-Hard  (khuller1999budgeted, ). ∎

While exactly solving the optimization problem in Eqn. 1 is NP-Hard, the specific properties of the problem: non-monotonicity, submodularity, non-normality, non-negativity and the accompanying matroid constraints allow for applying algorithms with provable optimality guarantees. We employ an optimization procedure based on approximate local search (see Algorithm 1) which provides the best known theoretical guarantees ( 1/5 approximation) for this class of problems.

1:Input: Objective , domain , parameter , number of constraints
3:for  do Approximation local search procedure
4:     ; ;
5:     Let be the element with the maximum value for and set
6:     while there exists a delete/update operation which increases the value of by a factor of at least  do
7:         Delete Operation: If such that , then
9:         Exchange Operation If and (for ) such that
10:          (for ) satisfies all the constraints and
11:         , then
12:     end while
14:end for
15:return the solution corresponding to
Algorithm 1 Optimization Procedure (lee2009non, )

3. Experimental Evaluation

We evaluate our framework on a Depression diagnosis (lakkarajuinterpretable, ) dataset collected by an online health records portal comprising of medical history, symptoms, and demographic information of about 33K individuals. The class label of each individual is either depressed or healthy.


We benchmark the performance of our framework against the following baselines: 1) Locally interpretable model agnostic explanations (LIME) (ribeiro2016should, ) 2) Interpretable Decision Sets (IDS) (lakkarajuinterpretable, ) 3) Bayesian Decision Lists (BDL) (letham2015interpretable, ). We employ IDS and BDL to approximate other black box models by training them with the labels of the black box models as the ground truth labels. We also construct the following variants: 4) LIME-DS where each local linear model in the LIME approach is replaced with a decision set 5) BETA-LM where we group instances in the data based on neighborhood descriptors obtained using our approach and then fit a separate linear model for each of these neighborhoods.

Analyzing the Tradeoffs between Fidelity and Interpretability

Fidelity and interpretability are competing objectives, where fidelity favors details and nuances while interpretability favors simplicity. To understand how effectively different approaches trade-off fidelity with interpretability, we plot agreement rate vs. various metrics of interpretability (outlined in Section 2) for approximations generated by our framework and other baselines. We compute agreement rate, fraction of instances in the data for which the label assigned by the approximation is the same as that of the black box model prediction, as a measure of fidelity. Figures (a)a and (b)b show the plots of agreement rate vs. number of rules (size) and agreement rate vs. average number of predicates (ratio of numpreds to size) for the explanations constructed to approximate a 5 layer deep neural network using our model, LIME-DS, IDS, and BDL. Our approximations consistently demonstrate higher agreement rates at lower values of the desired metrics. For instance, at an average width of 10 predicates per rule, our approximation already reaches agreement rate of about 85% whereas other approaches require at least 20 predicates per rule to attain this agreement rate (Figure  (b)b). We plot agreement rate vs. number of neighborhoods for the approximations generated by our approach and its linear variant, LIME and LIME-DS (see Figure (c)c). Our approximations achieve high fidelity (about 85% agreement rate) with as few as 5 neighborhoods whereas LIME requires choosing about 20 neighborhoods to achieve the same agreement rate.

(a) Number of Rules
(b) Avg. Number of Predicates
(c) Number of Neighborhoods
Figure 5. Fidelity vs. Interpretability Trade Offs for Depression Diagnosis Data.

We also found that the approximations generated using IDS and our approach also result in low values of ruleoverlap (between and %) and high values for cover ( to %). Decision list representation by design achieves the optimal values of zero for ruleoverlap and for cover.

User Studies

We designed an online user study with 33 participants, where each participant was randomly presented with the approximations (for a 5 layer deep neural network model) generated by: 1) our approach 2) IDS 3) BDL. Participants were asked questions, each of which was designed to test the user’s understanding of the model behavior in different parts of feature space. An example question is: Consider a patient who is female and aged 65 years. Based on the approximation shown above, can you be absolutely sure that this patient is Healthy? If not, what other conditions need to hold for this patient to be labeled as Healthy? These questions closely mimic decision making in real-world settings where decision makers would like to reason about model behavior in certain parts of the feature space. We computed the accuracy of the answers provided by users. We also recorded the time taken to answer each question and used this to computed the average time spent (in seconds) on each question. Table 2 (top) show the results obtained using approximations from our model, IDS, and BDL. It can be seen that user accuracy associated with our approach was higher than that of IDS, BDL. Users were about 1.5 and 2.3 times faster when using our approximation compared to those constructed by IDS and BDL respectively.

Approach Human Accuracy Avg. Time (in secs.)
Our Approach - BETA 94.5% 160.1
IDS 89.2% 231.1
BDL 83.7% 368.5
Our Approach - BETA 98.3% 78.3
Table 2. Results of User Study.

We also measured the benefit obtained using interactivity, where the approximation presented to the user is customized w.r.t to the question the user is trying to answer. For example, imagine the question above now asking about a patient who smokes and does not exercise. Whenever a user is asked this question, we showed him/her an approximation where exercise and smoking appear in the neighborhood descriptors thus simulating the effect of the user trying to interactively explore the model w.r.t these features. We recruited 11 participants for this study and we asked each of these participants the same questions as those asked in task 1. It can be seen that the time taken to answer questions is almost reduced in half compared to the setting where we showed users the same approximation each time. Answers provided are also comparatively more accurate.


  • [1] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules.
  • [2] S. Khuller, A. Moss, and J. S. Naor. The budgeted maximum coverage problem. Information Processing Letters, 70(1):39–45, 1999.
  • [3] P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
  • [4] H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable decision sets: A joint framework for description and prediction. In KDD, 2016.
  • [5] J. Lee, V. S. Mirrokni, V. Nagarajan, and M. Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. In

    Proceedings of the forty-first annual ACM symposium on Theory of computing

    , pages 323–332. ACM, 2009.
  • [6] B. Letham, C. Rudin, T. H. McCormick, D. Madigan, et al. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350–1371, 2015.
  • [7] Y. Lou, R. Caruana, and J. Gehrke. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–158. ACM, 2012.
  • [8] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
  • [9] L. Rokach and O. Maimon. Top-down induction of decision trees classifiers-a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(4):476–487, 2005.