With the ever increasing adoption of black-box artificial intelligence technologies in various facets of society, a lot of interpretable algorithms have been used to explain the decisions taken by the black box models. They are chiefly of two types: a) Local Explanations for data point of interest [24, 12, 13, 14, 19, 6] and b) Constructing interpretable global models directly like decision trees, rule lists and boolean rules . One of the arguments for b) is that it is sometimes possible to construct interpretable global models in such a way that for a single data point it can give a succinct local explanation in the form of a sparse conjunction . Methods in a) naturally enjoy the parsimonious explanations on a single data point through either feature importance scores or contrastive points that differ in very few features. Another option is to derive a global interpretable model out of a complex one through means of information transfer [8, 7]. However, these new models may not be locally faithful to the original explanations of a complex black-box model.
As such, there has been limited amount of effort in directly leveraging the sparse local explanations which are logically conjunction of conditions on few features to build transparent models without sacrificing too much in performance. Potentially, an approach of building a globally transparent model from local explanations would retain the structure (sparse interaction of various features) of the local explanations based on a complex black-box model that is preferred in some specific application (Boosted Trees or Neural Nets being preferable to deploy due to accuracy).
We propose a new algorithm that uses local explanations from a contrastive explanations method to generate boolean clauses which are conjunctions. These boolean conjunctions can be used as features, forming a new dataset, to train another simple model like a sparse linear model such as logistic regression or a small decision tree. The algorithm binarizes local contrastive explanations depending on the difference in feature values between the contrast point and the original. The binarization of features is directed by local explanations based on ranges that are deemed locally important by the complex model. One of the most interesting aspects of this idea is that sparse interactions between original features required for explaining, are directly captured by these boolean clauses.
To showcase our idea we use the model agnostic contrastive explanations method 
, that generates local explanations in the form of pertinent positives (PPs) and pertinent negatives (PNs). PPs are a minimal set of features with minimal value that are sufficient to obtain the classification of the original input. For example, given an image of a 3 say in MNIST, the PP will be some subset of non-zero intensity pixels in the 3 with corresponding grey scale values not exceeding those in the original image which has the same classification as the original image of the 3. A PN on the other is the minimal set of features that if increased will change the classification of the original image. So in our image of a 3 example, a small set of pixels say at the right top which were zero before now have positive intensity values and are perceived as horizontal line making the image look like a 5 also to the classifier would constitute a PN.
The key insight is that each explanation with our mapping can be viewed as a conjunction (or ANDing) of literals of the form (discretized PPs) and their negations (discretized PNs) forming a boolean clause. This is why we chose contrastive explanations method as our local explainability technique as we can obtain both (a sparse set of) positive as well as negative literals as opposed to a lot of explainability methods that do not generate PNs. Moreover, this formula is likely to be small since as mentioned above the method returns sparse explanations. In principle, one could take a disjunction (ORing) of all these (local) formulas for a particular class and obtain a two-level boolean formula that can act as a global classifier. Of course, the formula would be too large and one may also have to evaluate the generalizability on a test set. Nonetheless, given these extracted formulas, which are in essence new sparse boolean conjunctions, one may be able to train a existing learner from a simple model class. We use logistic regression (with L1 penalty) or (small) decision trees as our simple base learners, however there are other possibilities (viz. boosting). These models end up consuming only a small fraction of these conjunctions. An illustration of the whole process that we just described is given in Figure 1, along with an example formula for an input in Figure 2.
2 Related Work
Most of the work on explainability in artificial intelligence can be said to fall under four major categories: Local posthoc methods, global posthoc methods, directly interpretable methods and visualization based methods.
Local Posthoc Methods: Methods under this category look to generate explanations at a per instance level for a given complex classifier that is uninterpretable. Methods in this category are either proxy model based [14, 17] or look into the internals of the model [1, 5, 24, 13]. Some of these methods also work with only black-box access [14, 6]. There are also a number of methods in this category specifically designed for images [20, 1, 18].
Global Posthoc Methods: These methods try to build an interpretable model on the whole dataset using information from the black-box model with the intention of approaching the black-box models performance. Methods in this category either use predictions (soft or hard) of the black-box model to train simpler interpretable models [8, 2, 3] or extract weights based on the prediction confidences reweighting the dataset .
Directly Interpretable Methods: Methods in this category include some of the traditional models such as decision trees or logistic regression. There has been a lot of effort recently to efficiently and accurately learn rule lists [15, 16] or two-level boolean rules  or decision sets . There has also been work inspired by other fields such as psychometrics  and healthcare .
Visualization based Methods:10]. The idea is that by exposing such representations one may be able to gauge if the neural network is in fact capturing semantically meaningful high level features.
The most relevant categories to our current endeavor are possibly the local and global posthoc methods. The global posthoc methods although try to capture the global behavior of the black-box models, the coupling is weak as it is mainly through trying to match the output behavior, and they do not leverage or are necessarily consistent with the local explanations one might obtain.
In this section we first describe the strategy to obtain local contrastive explanations for arbitrary black-box models. We then show how our method Global Boolean Feature Learning (GBFL) maps these explanations to boolean formulas, which is is our main contribution, that can subsequently be consumed by simple models as features to learn on.
3.1 Obtaining Contrastive Explanations
To learn our boolean features we first need a local explainability technique that can extract contrastive explanations for us from arbitrary black-box models. The method we use is the model agnostic contrastive explanations method , which can generate PPs and PNs with just black-box access.
Formal Definitions: We describe the definition of a PP and a PN formally. We also describe what we would get if we use the constrastive explanations method of . Consider a training dataset consisting of samples . denotes the -th training sample. and . is a finite set of class labels. Let us denote the training dataset by .
Base Value Vector: To find PPs/PNs their method requires specifying values for each feature that are least interesting, which they term as base values. A user can prespecify semantically meaningful base values or a default value of say median could be set as a base value. Classifiers essentially pick out the correlation between variation from this value in a given co-ordinate and the target class
. Therefore, we define a vector of base values. Variation away from the base value is used to correlate with the target class. represents the base value of the -th feature.
Upper and Lower Bounds: Let and be lower and upper bounds for the -th feature .
Consider a pre-trained classifier and let
denote the classifiers confidence score (a probability) of classgiven .
Pertinent Positive: Let denote the pertinent positive vector associated with a training sample .
In other words, Pertinent positive is a sparse vector that the classifier classifies in the same class as the original input with high confidence. Being sparse, it is expected to have lesser variation away from the base values than .
Pertinent Negative: Let denote the pertinent positive vector associated with a training sample .
In other words, Pertinent negative is a vector where there are few coordinates which are different from . Also, it forces the classifier to classify it in some other class with high confidence, and in coordinates where it differs from , those coordinates are farther from the base values than those of .
Remark: Although, the method in  does not really perform the constrained optimization in (3.1,3.1) but uses regularization like ‘Elasticnet’ penalty to impose sparsity, we will assume that our PPs and PNs are the result of these optimizations just for simplicity of exposition. The only difference is that the sparsity cannot be pre-determined but is typically a constant for many training samples in practice.
3.2 Generating Sparse Boolean Clauses from Pertinent Positives and Pertinent Negatives
The following is the key observation of our work that lets us mine interpretable Boolean features.
Key Idea: For a give training point , we observe that a pertinent positive and negative give rise to a -sparse Boolean AND clause as follows.
Pertinent positives says simply that a specific feature has to have at least the variation of the pertinent positive feature value for it to be classified into that class. Similarly, pertinent negatives say that a specific feature can have a variation more than but not beyond .
Therefore, one can come up with the following Boolean Clause written as product of indicators:
Bounds using grid points to regularize: In practice, these explanations are very local and hence adding further bounds will help in generalization. This is because if for some training example , is the condition. Then points from some other class can also satisfy this inequality which is an infinite interval on the real line. Since these clauses are derived from local perturbations (because of penalty in the optimization), they may not be valid very far from .
We first need to reduce the number of distinct clauses, so only uses clauses involving grid points where every coordinate has at most grid points. We will call the matrix of as the Grid Matrix .
We round off all the clauses to the nearest grid points suitably and also add regularizing upper bounds using grid points that are far away from the grid point involved in the clause. We describe this in detail below for out of the cases that arise.
For some feature , suppose the pertinent positive is such that where is the base value of the feature. We find two closest grid points as follows: a) and closest to and b) that is closest to . Here, and are their indices when you sort the grid points from the lowest to the highest. Then, instead of a clause , we will have conjunction of two clauses: where is a skip parameter, that we optimize over during cross validation. For a pertinent negative for a feature satisfying , we find two grid points and is the closest to and such that and is the closest to and . Now, instead of the clause , we substitute the conjunction . Now, there are two other cases where a pertinent positive is less than the base value and when we have a pertinent negative for an . Similar rounding is done. We state our boolean clause generation algorithm incorporating all these ideas in Algorithm 1 for various cases of relative ordering between base values, pertinent positives and negatives and the feature values.
KDE Binning: The binning technique used to determine the grid points is the most essential part of the algorithm. We actually would like to place the grid points such that it creates equally spaced intervals such that every interval has equal probability. Consider the -th feature . Suppose, and
are lower and upper bounds for this feature. We estimate the marginal density of this feature using a KDE estimate by using an appropriate kernel with a bandwidth on the points taken from this feature and obtain a Cumulative density function. Suppose is the number of grid points we desire. Now, using root finding techniques, we actually find
-th quantile for. This grid generation procedure is given in Algorithm 2.
Learning Algorithm: We assume that a base learner () is given to us like a Decision Tree learner or a Logistic Regression based learner. Then algorithm 3
learns a transparent model based on the boolean rules/features extracted using GBFL.
We now empirically validate our method. We first describe the setup, followed by a discussion of the experimental results. We provide quantitative results as well as present the most important boolean features picked by the base learners based on our construction.
We experimented on five publicly available datasets from Kaggle and UCI repository namely; Sky Survey, Credit Card, WDBC, Higgs Boson and Waveform. The dataset characteristics are in Table 1. For the Sky Survey dataset , which has three classes we also did binary classification tasks by considering pairs of classes, as those were also deemed as interesting during the Kaggle competition that this data was part of. The Higgs Boson dataset has 250 thousand training points. We randomly subsampled 20 thousand points and trained the classifiers.
Random Forest with 100 trees (RF) was taken to be the black-box classifier and logistic regression (with L1 penalty) trained on the original features is the base learner for Sky Survey and Credit Card datasets. A four layered deep neural network (DNN) with fully connected layers (100 50 10
softmax) was chosen as the black-box model for WDBC, Higgs Boson and Waveform datasets with a decision tree (height <= 5) being the base learner. This shows the wide applicability of our approach to different black-box models as well as base learners that may or may not be differentiable. Statistically significant results that measure performance based on paired t-test are reported computed over 10 randomizations with 75/25% train/test split. 10-fold cross-validation is used to find all parameters.
|(Star vs Galaxy)|
|(Galaxy vs Quasar)|
|(Quasar vs Galaxy)|
|Higgs Boson||DNN||0.70||Decision Tree||0.63||0.68|
4.2 Quantitative Evaluation and Implications
We see a few noticeable trends from Table 2. Firstly, our method actually improves the performance of the base learner in many cases. This means that the constructed (sparse) boolean features from the local contrastive explanations have valuable information about the prediction task over and above what the dataset offers. This could have interesting implications from at least two perspectives: 1) building accurate interpretable/transparent models that are robust and 2) from a privacy perspective, where one may not want to reveal too much information about their model. From the first perspective, our approach provides an avenue to leverage black-box models and corresponding local explanations to build a transparent model which could be deployed in high-stakes decision making. The models are also likely to be robust as they are based on boolean features that are non-differentiable. From the second perspective, not only are we somewhat replicating the black-box models performance, but since our model is transparent and based on its local explanations we might be revealing intricate details regarding its functioning that even a human user may be able to understand and replicate. This may not be acceptable to the model owner.
Secondly, we see in general that bigger the gap in performance between the black-box model and the base learner trained on the original dataset more the relative improvement. This is not too surprising, as if the local explanations are fidel to the black-box model they should contain rich information that cannot be readily extracted from the original dataset, at least using simple base learners.
Thirdly, the good generalization shown by GBFL implies that the contrastive explanations themselves seem to capture relevant information and hence, GBFLs performance could be a testament to overall quality of the explanations in terms of characterizing the model. It could thus be seen as a (global) quantitative metric to evaluate such as explanations, since locally both PPs and PNs always lie/do not lie in the predicted class.
4.3 Qualitative Evaluation
We now show boolean features constructed by our method for some of the datasets using the contrastive explanations for the respective black-box models that the base learner deems as most important. We see that although the boolean features are composed of multiple input/original features, the resultant model is still transparent and reasonably easy to parse. Definitely more so than the original black-box model.
Sky Survey Dataset In Listing 1, we see the top 4 boolean features based on L1-Logistic which is the base learner. We observe that nine out of the 17 original features were selected by the contrastive explanations which are a union of the PP and PN features. We then constructed boolean features out of them using algorithm 1. The different boolean features thus have conditions i.e. upper and lower bounds on the same set of original features. We can also see that some of the original features such as redshift, mjd and dec have the same condition across multiple boolean features indicating that them along with their ranges are likely to be most important. On the other hand, ra has different ranges in most cases. The other features have in between redundancy relative to ranges. As can be seen each boolean feature is composed of multiple input features which may make the formulas complicated. However, they still are formulas which can be parsed and the final decision for an input could be traced by following the conditions and noting which boolean features would be satisfied.
In Listing 2, we see the top 3 boolean features based on a small decision tree as the base learner. Again, boolean features were constructed by parsing PPs and PNs for the training points. The number of original input features selected is more for this dataset than Sky Survey. Nonetheless, since they are still just boolean formulas the model is transparent and in this case too the decision for a data point can be traced by following the conditions. Here too features such asn1_concavepts, n0_fractald, n1_concavity and n2_symmetry are repeated in all the boolean features with the same range of values, which could be indicative these input features along with their ranges being important in the decision making. Other features either do not repeat or repeat but have different conditions/ranges.
As systems get more complicated (Neural Networks and Boosted Trees) replicating their performance using simple interpretable models might become increasingly challenging. Transparent models could be the answer here, where there is more leeway to build complicated models that can be traced for the decisions they make and hence are auditable. Auditability is extremely important in domains such as finance, where decisions need to be traceable and proxy models to explain black-boxes are not really acceptable . Moreover, transparent boolean classifiers have an added advantage of efficiency where even large boolean formulas can potentially be made extremely scalable by implementing them in hardware.
Models built from local explanations showing good generalization is in some sense a true testament to the fidelity of the explanations and useful information they possess. Hence, accuracy of models built on our boolean features could provide a global view into the quality of these local explanations. On the flip side though, this could raise privacy concerns in terms of not only (mostly) replicating the performance of a proprietary black-box model, but also making its decisions transparent to a human who could gain unwanted insight into its functioning.
In the future, we would like to build other classifiers (viz. weighted rule sets) using our boolean features. Moreover, we would also like to study the theoretically reasons behind the good generalization provided by our method. We conjecture that this has connections to the stability results shown for stochastic gradient descent in deep learning settings, given that the local explanation method primarily relies on gradient descent.
-  Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
-  Osbert Bastani, Carolyn Kim, and Hamsa Bastani. Interpreting blackbox models via model extraction. arXiv preprint arXiv:1705.08504, 2017.
-  Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.
-  Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1721–1730, New York, NY, USA, 2015. ACM.
-  Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems 31. 2018.
-  Amit Dhurandhar, Tejaswini Pedapati, Avinash Balakrishnan, PinYu Chen, Karthikeyan Shanmugam, and Ruchir Puri. Model agnostic contrastive explanations for structured data. arxiv, 2019.
-  Amit Dhurandhar, Karthikeyan Shanmugam, Ronny Luss, and Peder Olsen. Improving simple models with confidence profiles. Advances of Neural Inf. Processing Systems (NeurIPS), 2018.
-  Jeff Dean Geoffrey Hinton, Oriol Vinyals. Distilling the knowledge in a neural network. In https://arxiv.org/abs/1503.02531, 2015.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt
Schiele, and Trevor Darrell.
Generating visual explanations.
European Conference on Computer Vision, 2016.
-  Tsuyoshi Idé and Amit Dhurandhar. Supervised item response models for informative prediction. Knowl. Inf. Syst., 51(1):235–257, April 2017.
-  Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
-  Ramaravind Kommiya Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. arXiv preprint arXiv:1905.07697, 2019.
-  Marco Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?” explaining the predictions of any classifier. In ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, 2016.
Please stop explaining black box models for high stakes decisions.
NIPS Workshop on Critiquing and Correcting Trends in Machine Learning, 2018.
-  Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
-  Su-In Lee Scott Lundberg. Unified framework for interpretable methods. In In Advances of Neural Inf. Proc. Systems, 2017.
-  Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. See https://arxiv. org/abs/1610.02391 v3, 2016.
-  Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
-  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
Introduction to the Theory of Computation 3rd.Cengage Learning, 2013.
-  SkyServer. Kaggle, 2018.
-  Guolong Su, Dennis Wei, Kush Varshney, and Dmitry Malioutov. Interpretable two-level boolean rule learning for classification. In https://arxiv.org/abs/1606.05798, 2016.
-  Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR, 2017.