The Intriguing Properties of Model Explanations

01/30/2018 ∙ by Maruan Al-Shedivat, et al. ∙ Carnegie Mellon University 0

Linear approximations to the decision boundary of a complex model have become one of the most popular tools for interpreting predictions. In this paper, we study such linear explanations produced either post-hoc by a few recent methods or generated along with predictions with contextual explanation networks (CENs). We focus on two questions: (i) whether linear explanations are always consistent or can be misleading, and (ii) when integrated into the prediction process, whether and how explanations affect the performance of the model. Our analysis sheds more light on certain properties of explanations produced by different methods and suggests that learning models that explain and predict jointly is often advantageous.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model interpretability is a long-standing problem in machine learning that has become quite acute with the accelerating pace of widespread adoption of complex predictive algorithms. There are multiple approaches to interpreting models and their predictions ranging from a variety of visualization techniques 

(Simonyan et al., 2013; Yosinski et al., 2015; Mahendran and Vedaldi, 2015) to explanations by example (Caruana et al., 1999; Kim et al., 2014). The approach that we consider in this paper thinks of explanations as models themselves that approximate the decision boundary of the original predictor but belong to a class that is significantly simpler (e.g., local linear approximations).

Explanations can be generated either post-hoc or alongside predictions. A popular method, called LIME (Ribeiro et al., 2016), takes the first approach and attempts to explain predictions of an arbitrary model by searching for linear local approximations of the decision boundary. On the other hand, recently proposed contextual explanation networks (CENs) (Al-Shedivat et al., 2017)

incorporate a similar mechanism directly into deep neural networks of arbitrary architecture and learn to predict and to explain jointly. Here, we focus on analyzing a few properties of the explanations generated by LIME, its variations, and CEN. In particular, we seek answers to the following questions:

  1. [leftmargin=2em,itemsep=-0.5pt,topsep=-1pt]

  2. Explanations are as good as the features they use to explain predictions. We ask whether and how feature selection and feature noise affect consistency of explanations.

  3. When explanation is a part of the learning and prediction process, how does that affect performance of the predictive model?

  4. Finally, what kind of insight we can gain by visualizing and inspecting explanations?

2 Methods

We start with a brief overview of the methods compared in this paper: LIME (Ribeiro et al., 2016) and CENs (Al-Shedivat et al., 2017). Given a dataset of inputs, , and targets, , our goal is to learn a predictive model, . To explain each prediction, we have access to another set of features, , and construct explanations, , such that they are consistent with the original model, . These additional features, , are assumed to be more interpretable than , and are called interpretable representation in Ribeiro et al. (2016) and attributes in (Al-Shedivat et al., 2017).

2.1 LIME and Variations

Given a trained model, , and an instance with features , LIME constructs an explanation, , as follows:


where is the loss that measures how well approximates in the neighborhood defined by the similarity kernel, , in the space of additional features, , and is the penalty on the complexity of explanation. Now more specifically, Ribeiro et al. (2016) assume that is the class of linear models:


and define the loss and the similarity kernel as follows:


where the data instance is represented by , and the corresponding are the perturbed features, is some distance function, and is the scale parameter of the kernel. is further chosen to favor sparsity of explanations.

2.2 Contextual Explanation Networks

LIME is a post-hoc model explanation method. This means that it justifies model predictions by producing explanations which, while locally correct, are never used to make the predictions in the first place. Contrary to that, CENs use explanations as the integral part of the learning process and make predictions by applying generated explanations. Now more formally, CENs construct the predictive model via a composition: given , an encoder, , produces an explanation which is further applied to to make a prediction. In other words:


In (Al-Shedivat et al., 2017) we introduced a more general probabilistic framework that allows to combine different deterministic and probabilistic encoders with explanations represented by arbitrary graphical models. To keep our discussion simple and concrete, here we assume that explanations take the same linear form (2) as for LIME and the encoder maps to as follows:


In other words, explanation is constrained to be a convex combination of components from a global learnable dictionary, , where the combination weights, , also called attention, are produced by a deep network. Encoder of such form is called constrained deterministic map in (Al-Shedivat et al., 2017) and the model is trained jointly w.r.t. to minimize the prediction error.

3 Analysis

Both LIME and CEN produce explanations in the form of linear models that can be further used for prediction diagnostics. Our goal is to understand how different conditions affect explanations generated by both methods, see whether this may lead to erroneous conclusions, and finally understand how jointly learning to predict and to explain affects performance.

We use the following 3 tasks in our analysis: MNIST image classification111, sentiment classification of the IMDB reviews (Maas et al., 2011), and poverty prediction for households in Uganda from satellite imagery and survey data (Jean et al., 2016). The details of the setup are omitted in the interest of space but can be found in (Al-Shedivat et al., 2017), as we follow exactly the same setup.

3.1 Consistency of Explanations

Linear explanation assign weights to the interpretable features, , and hence strongly depend their quality and the way we select them. We consider two cases where (a) the features are corrupted with additive noise, and (b) selected features are incomplete. For analysis, we use MNIST and IMDB data.

Fig. 1: The effect of feature quality on explanations. (a) Explanation test error vs. the level of the noise added to the interpretable features. (b) Explanation test error vs. the total number of interpretable features.

We train baseline deep architectures (CNN on MNIST and LSTM on IMDB) and their CEN variants. For MNIST, is either pixels of a scaled down image (pxl) or HOG features (hog). For IMDB, is either a bag of words (bow

) or a topic vector (

tpc) produced by a pre-trained topic model.

The effect of noisy features. In this experiment, we inject noise222

We use Gaussian noise with zero mean and select variance for each signal-to-noise ratio level appropriately.

into the features and ask LIME and CEN to fit explanations to the noisy features. The predictive performance of the produced explanations on noisy features is given on Fig. 0(a). Note that after injecting noise, each data point has a noiseless representation and noisy . Since baselines take only as inputs, their performance stays the same and, regardless of the noise level, LIME “successfully” overfits explanations—it is able to almost perfectly approximate the decision boundary of the baselines using very noisy features. On the other hand, performance of CEN gets worse with the increasing noise level indicating that model fails to learn when the selected interpretable representation is low quality.

The effect of feature selection. Here, we use the same setup, but instead of injecting noise into , we construct by randomly subsampling a set of dimensions. Fig. 0(b) demonstrates the result. While performance of CENs degrades proportionally to the size of , we see that, again, LIME is able to fit explanations to the decision boundary of the original models despite the loss of information.

These two experiments indicate a major drawback of explaining predictions post-hoc: when constructed on poor, noisy, or incomplete features, such explanations can overfit the decision boundary of a predictor and are likely to be misleading. For example, predictions of a perfectly valid model might end up getting absurd explanations which is unacceptable from the decision support point of view.

3.2 Explanations as a Regularizer

In this part, we compare CENs with baselines in terms of performance. In each task, CENs are trained to simultaneously generate predictions and construct explanations. Overall, CENs show very competitive performance and are able to approach or surpass baselines in a number of cases, especially on the IMDB data (see Table 1). This suggests that forcing the model to produce explanations along with predictions does not limit its capacity.

Fig. 2:

(a) Training error vs. iteration (epoch or batch) for baselines and CENs. (b) Validation error for models trained on random subsets of data of different sizes.

Additionally, the “explanation layer” in CENs affects the geometry of the optimization problem and causes faster and better convergence (Fig. 1(a)). Finally, we train the models on subsets of data (the size varied from 1% to 20% for MNIST and from 2% to 40% for IMDB) and notice that explanations play the role of a regularizer which strongly improves the sample complexity (Fig. 1(b)).

MNIST IMDB Satellite
Model Err (%) Model Err (%) Model Acc (%) AUC (%)

Best previous results for similar LSTMs: (supervised) and (semi-supervised) Johnson and Zhang (2016).

Table 1: Performance of the models on classification tasks (averaged over 5 runs; the std. are on the order of the least significant digit). The subscripts denote the features on which the linear models are built: pixels (pxl), HOG (hog), bag-or-words (bow), topics (tpc), embeddings (emb), discrete attributes (att).

3.3 Visualizing Explanations

Finally, we showcase the insights one can get from explanations produced along with predictions. Particularly, we consider the problem of poverty prediction for household clusters in a Uganda from satellite imagery and survey data. The representation of each household cluster is a collection of satellite images; is represented by a vector of 65 categorical features from living standards measurement survey (LSMS). The goal is binary classification of households in Uganda into poor and not poor. In our methodology, we closely follow the original study of Jean et al. (2016) and use a pretrained network for embedding the images into a 4096-dimensional space on top of which we build our contextual models. Note that this datasets is fairly small (642 points), and hence we keep the frozen to avoid overfitting. We note that quantitatively, by conditioning on the VGG features of the satellite imagery, CENs are able to significantly improve upon the sparse linear models on the survey features only (known as the gold standard in remote sensing techniques).

After training CEN with a dictionary of size 32, we discover that the encoder tends to sharply select one of the two explanations (M1 and M2) for different household clusters in Uganda (see Fig. 2(a) and also Fig. 3(a) in appendix). In the survey data, each household cluster is marked as either urban or rural; we notice that, conditional on a satellite image, CEN tends to pick M1 for urban areas and M2 for rural (Fig. 2(b)). Notice that explanations weigh different categorical features, such as reliability of the water source or the proportion of houses with walls made of unburnt brick, quite differently. When visualized on the map, we see that CEN selects M1 more frequently around the major city areas, which also correlates with high nightlight intensity in those areas (Fig. 2(c),2(d)). High performance of the model makes us confident in the produced explanations (contrary to LIME as discussed in Sec. 3.1

) and allows us to draw conclusions about what causes the model to classify certain households in different neighborhoods as poor.

Fig. 3: Qualitative results for the Satellite dataset: (a) Weights given to a subset of features by the two models (M1 and M2) discovered by CEN. (b) How frequently M1 and M2 are selected for areas marked rural or urban (top) and the average proportion of Tenement-type households in an urban/rural area for which M1 or M2 was selected. (c) M1 and M2 models selected for different areas on the Uganda map. M1 tends to be selected for more urbanized areas while M2 is picked for the rest. (d) Nightlight intensity of different areas of Uganda.


Appendix A Appendix

(a) Full visualization of explanations M1 and M2 learned by CEN on the poverty prediction task.
(b) Correlation between the selected explanation and the value of a particular survey variable.
Fig. 4: Additional visualizations for the poverty prediction task.

Appendix B Details on Consistency of Explanations

We provide a detailed description of the experimental setup used for our analysis in Section 3.1.