Model interpretability is a long-standing problem in machine learning that has become quite acute with the accelerating pace of widespread adoption of complex predictive algorithms. There are multiple approaches to interpreting models and their predictions ranging from a variety of visualization techniques(Simonyan et al., 2013; Yosinski et al., 2015; Mahendran and Vedaldi, 2015) to explanations by example (Caruana et al., 1999; Kim et al., 2014). The approach that we consider in this paper thinks of explanations as models themselves that approximate the decision boundary of the original predictor but belong to a class that is significantly simpler (e.g., local linear approximations).
Explanations can be generated either post-hoc or alongside predictions. A popular method, called LIME (Ribeiro et al., 2016), takes the first approach and attempts to explain predictions of an arbitrary model by searching for linear local approximations of the decision boundary. On the other hand, recently proposed contextual explanation networks (CENs) (Al-Shedivat et al., 2017)
incorporate a similar mechanism directly into deep neural networks of arbitrary architecture and learn to predict and to explain jointly. Here, we focus on analyzing a few properties of the explanations generated by LIME, its variations, and CEN. In particular, we seek answers to the following questions:
Explanations are as good as the features they use to explain predictions. We ask whether and how feature selection and feature noise affect consistency of explanations.
When explanation is a part of the learning and prediction process, how does that affect performance of the predictive model?
Finally, what kind of insight we can gain by visualizing and inspecting explanations?
We start with a brief overview of the methods compared in this paper: LIME (Ribeiro et al., 2016) and CENs (Al-Shedivat et al., 2017). Given a dataset of inputs, , and targets, , our goal is to learn a predictive model, . To explain each prediction, we have access to another set of features, , and construct explanations, , such that they are consistent with the original model, . These additional features, , are assumed to be more interpretable than , and are called interpretable representation in Ribeiro et al. (2016) and attributes in (Al-Shedivat et al., 2017).
2.1 LIME and Variations
Given a trained model, , and an instance with features , LIME constructs an explanation, , as follows:
where is the loss that measures how well approximates in the neighborhood defined by the similarity kernel, , in the space of additional features, , and is the penalty on the complexity of explanation. Now more specifically, Ribeiro et al. (2016) assume that is the class of linear models:
and define the loss and the similarity kernel as follows:
where the data instance is represented by , and the corresponding are the perturbed features, is some distance function, and is the scale parameter of the kernel. is further chosen to favor sparsity of explanations.
2.2 Contextual Explanation Networks
LIME is a post-hoc model explanation method. This means that it justifies model predictions by producing explanations which, while locally correct, are never used to make the predictions in the first place. Contrary to that, CENs use explanations as the integral part of the learning process and make predictions by applying generated explanations. Now more formally, CENs construct the predictive model via a composition: given , an encoder, , produces an explanation which is further applied to to make a prediction. In other words:
In (Al-Shedivat et al., 2017) we introduced a more general probabilistic framework that allows to combine different deterministic and probabilistic encoders with explanations represented by arbitrary graphical models. To keep our discussion simple and concrete, here we assume that explanations take the same linear form (2) as for LIME and the encoder maps to as follows:
In other words, explanation is constrained to be a convex combination of components from a global learnable dictionary, , where the combination weights, , also called attention, are produced by a deep network. Encoder of such form is called constrained deterministic map in (Al-Shedivat et al., 2017) and the model is trained jointly w.r.t. to minimize the prediction error.
Both LIME and CEN produce explanations in the form of linear models that can be further used for prediction diagnostics. Our goal is to understand how different conditions affect explanations generated by both methods, see whether this may lead to erroneous conclusions, and finally understand how jointly learning to predict and to explain affects performance.
We use the following 3 tasks in our analysis: MNIST image classification111http://yann.lecun.com/exdb/mnist/, sentiment classification of the IMDB reviews (Maas et al., 2011), and poverty prediction for households in Uganda from satellite imagery and survey data (Jean et al., 2016). The details of the setup are omitted in the interest of space but can be found in (Al-Shedivat et al., 2017), as we follow exactly the same setup.
3.1 Consistency of Explanations
Linear explanation assign weights to the interpretable features, , and hence strongly depend their quality and the way we select them. We consider two cases where (a) the features are corrupted with additive noise, and (b) selected features are incomplete. For analysis, we use MNIST and IMDB data.
We train baseline deep architectures (CNN on MNIST and LSTM on IMDB) and their CEN variants. For MNIST, is either pixels of a scaled down image (pxl) or HOG features (hog). For IMDB, is either a bag of words (bow
) or a topic vector (tpc) produced by a pre-trained topic model.
The effect of noisy features.
In this experiment, we inject noise222 We use Gaussian noise with zero mean and select variance for each signal-to-noise ratio level appropriately.
We use Gaussian noise with zero mean and select variance for each signal-to-noise ratio level appropriately.into the features and ask LIME and CEN to fit explanations to the noisy features. The predictive performance of the produced explanations on noisy features is given on Fig. 0(a). Note that after injecting noise, each data point has a noiseless representation and noisy . Since baselines take only as inputs, their performance stays the same and, regardless of the noise level, LIME “successfully” overfits explanations—it is able to almost perfectly approximate the decision boundary of the baselines using very noisy features. On the other hand, performance of CEN gets worse with the increasing noise level indicating that model fails to learn when the selected interpretable representation is low quality.
The effect of feature selection. Here, we use the same setup, but instead of injecting noise into , we construct by randomly subsampling a set of dimensions. Fig. 0(b) demonstrates the result. While performance of CENs degrades proportionally to the size of , we see that, again, LIME is able to fit explanations to the decision boundary of the original models despite the loss of information.
These two experiments indicate a major drawback of explaining predictions post-hoc: when constructed on poor, noisy, or incomplete features, such explanations can overfit the decision boundary of a predictor and are likely to be misleading. For example, predictions of a perfectly valid model might end up getting absurd explanations which is unacceptable from the decision support point of view.
3.2 Explanations as a Regularizer
In this part, we compare CENs with baselines in terms of performance. In each task, CENs are trained to simultaneously generate predictions and construct explanations. Overall, CENs show very competitive performance and are able to approach or surpass baselines in a number of cases, especially on the IMDB data (see Table 1). This suggests that forcing the model to produce explanations along with predictions does not limit its capacity.
(a) Training error vs. iteration (epoch or batch) for baselines and CENs. (b) Validation error for models trained on random subsets of data of different sizes.
Additionally, the “explanation layer” in CENs affects the geometry of the optimization problem and causes faster and better convergence (Fig. 1(a)). Finally, we train the models on subsets of data (the size varied from 1% to 20% for MNIST and from 2% to 40% for IMDB) and notice that explanations play the role of a regularizer which strongly improves the sample complexity (Fig. 1(b)).
|Model||Err (%)||Model||Err (%)||Model||Acc (%)||AUC (%)|
Best previous results for similar LSTMs: (supervised) and (semi-supervised) Johnson and Zhang (2016).
3.3 Visualizing Explanations
Finally, we showcase the insights one can get from explanations produced along with predictions. Particularly, we consider the problem of poverty prediction for household clusters in a Uganda from satellite imagery and survey data. The representation of each household cluster is a collection of satellite images; is represented by a vector of 65 categorical features from living standards measurement survey (LSMS). The goal is binary classification of households in Uganda into poor and not poor. In our methodology, we closely follow the original study of Jean et al. (2016) and use a pretrained network for embedding the images into a 4096-dimensional space on top of which we build our contextual models. Note that this datasets is fairly small (642 points), and hence we keep the frozen to avoid overfitting. We note that quantitatively, by conditioning on the VGG features of the satellite imagery, CENs are able to significantly improve upon the sparse linear models on the survey features only (known as the gold standard in remote sensing techniques).
After training CEN with a dictionary of size 32, we discover that the encoder tends to sharply select one of the two explanations (M1 and M2) for different household clusters in Uganda (see Fig. 2(a) and also Fig. 3(a) in appendix). In the survey data, each household cluster is marked as either urban or rural; we notice that, conditional on a satellite image, CEN tends to pick M1 for urban areas and M2 for rural (Fig. 2(b)). Notice that explanations weigh different categorical features, such as reliability of the water source or the proportion of houses with walls made of unburnt brick, quite differently. When visualized on the map, we see that CEN selects M1 more frequently around the major city areas, which also correlates with high nightlight intensity in those areas (Fig. 2(c),2(d)). High performance of the model makes us confident in the produced explanations (contrary to LIME as discussed in Sec. 3.1
) and allows us to draw conclusions about what causes the model to classify certain households in different neighborhoods as poor.
- Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Yosinski et al. (2015) Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
- Mahendran and Vedaldi (2015) Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In
- Caruana et al. (1999) Rich Caruana, Hooshang Kangarloo, JD Dionisio, Usha Sinha, and David Johnson. Case-based explanation of non-case-based learning methods. In Proceedings of the AMIA Symposium, page 212, 1999.
- Kim et al. (2014) Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generative approach for case-based reasoning and prototype classification. In Advances in Neural Information Processing Systems, pages 1952–1960, 2014.
- Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why Should I Trust You?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
- Al-Shedivat et al. (2017) Maruan Al-Shedivat, Avinava Dubey, and Eric P Xing. Contextual explanation networks. arXiv preprint arXiv:1705.10301, 2017.
Maas et al. (2011)
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Learning word vectors for sentiment analysis.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics, 2011.
- Jean et al. (2016) Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
- Johnson and Zhang (2016) Rie Johnson and Tong Zhang. Supervised and semi-supervised text categorization using lstm for region embeddings. In Proceedings of The 33rd International Conference on Machine Learning, pages 526–534, 2016.
Appendix A Appendix
Appendix B Details on Consistency of Explanations
We provide a detailed description of the experimental setup used for our analysis in Section 3.1.