Explaining Latent Representations with a Corpus of Examples

by   Jonathan Crabbé, et al.
University of Cambridge

Modern machine learning models are complicated. Most of them rely on convoluted latent representations of their input to issue a prediction. To achieve greater transparency than a black-box that connects inputs to predictions, it is necessary to gain a deeper understanding of these latent representations. To that aim, we propose SimplEx: a user-centred method that provides example-based explanations with reference to a freely selected set of examples, called the corpus. SimplEx uses the corpus to improve the user's understanding of the latent space with post-hoc explanations answering two questions: (1) Which corpus examples explain the prediction issued for a given test example? (2) What features of these corpus examples are relevant for the model to relate them to the test example? SimplEx provides an answer by reconstructing the test latent representation as a mixture of corpus latent representations. Further, we propose a novel approach, the Integrated Jacobian, that allows SimplEx to make explicit the contribution of each corpus feature in the mixture. Through experiments on tasks ranging from mortality prediction to image classification, we demonstrate that these decompositions are robust and accurate. With illustrative use cases in medicine, we show that SimplEx empowers the user by highlighting relevant patterns in the corpus that explain model representations. Moreover, we demonstrate how the freedom in choosing the corpus allows the user to have personalized explanations in terms of examples that are meaningful for them.



There are no comments yet.


page 28


Example and Feature importance-based Explanations for Black-box Machine Learning Models

As machine learning models become more accurate, they typically become m...

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Despite widespread adoption, machine learning models remain mostly black...

Using StyleGAN for Visual Interpretability of Deep Learning Models on Medical Images

As AI-based medical devices are becoming more common in imaging fields l...

Explaining Predictions by Approximating the Local Decision Boundary

Constructing accurate model-agnostic explanations for opaque machine lea...

An exact counterfactual-example-based approach to tree-ensemble models interpretability

Explaining the decisions of machine learning models is becoming a necess...

Explaining a Series of Models by Propagating Local Feature Attributions

Pipelines involving a series of several machine learning models (e.g., s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and related work

How can we make a machine learning model convincing? If accuracy is undoubtedly necessary, it is rarely sufficient. As these models are used in critical areas such as medicine, finance and the criminal justice system, their black-box nature appears as a major issue Lipton (2016); Ching et al. (2018); Tjoa and Guan (2020)

. With the necessity to address this problem, the landscape of explainable artificial intelligence (XAI) developed 

Barredo Arrieta et al. (2020); Das and Rad (2020). A first approach in XAI is to focus on white-box models that are interpretable by design. However, restricting to a class of inherently interpretable models often comes at the cost of lower prediction accuracy Rai (2019). In this work, we rather focus on post-hoc explainability techniques. These methods aim at improving the interpretability of black-box models by complementing their predictions with various kinds of explanations. In this way, it is possible to understand the prediction of a model without sacrificing its prediction accuracy.

Feature importance explanations are undoubtedly the most widespread type of post-hoc explanations. Popular feature importance methods include SHAP Shapley (1953); Datta et al. (2016); Lundberg and Lee (2017), LIME Ribeiro et al. (2016), Integrated Gradients Sundararajan et al. (2017), Contrastive Examples Dhurandhar et al. (2018) and Masks Fong and Vedaldi (2017); Fong et al. (2019); Crabbé and van der Schaar (2021). These methods complement the model prediction for an input example with a score attributed to each input feature. This score reflects the importance of each feature for the model to issue its prediction. Knowing which features are important for a model prediction certainly provides more information on the model than the prediction by itself. However, these methods do not provide a reason as to why the model pays attention to these particular features.

Another approach is to contextualize each model prediction with the help of relevant examples. In fact, recent works Nguyen et al. (2021) have demonstrated that human subjects often find example-based explanations more insightful than feature importance explanations. Complementing the model’s predictions with relevant examples previously seen by the model is commonly known as Case-Based Reasoning (CBR) Caruana et al. (1999); Bichindaritz and Marling (2006); Keane and Kenny (2019a). The implementations of CBR generally involve models that create a synthetic representation of the dataset, where examples with similar patterns are summarized by prototypes Kim et al. (2015, 2016); Gurumoorthy et al. (2017). At inference time, these models relate new examples to one or several prototypes to issue a prediction. In this way, the patterns that are used by the model to issue a prediction are made explicit with the help of relevant prototypes. A limitation of this approach is the restricted model architecture. The aforementioned procedure requires to opt for a family of models that rely on prototypes to issue a prediction. This family of model might not always be the most suitable for the task at hand. This motivates the development of generic post-hoc methods that make few or no assumption on the model.

The most common approach to provide example-based explanations for a wide variety of models mirrors feature importance methods. The idea is to complement the model prediction by attributing a score to each training example. This score reflects the importance of each training example for the model to issue its prediction. This score will typically be computed by simulating the effect of removing each training instance from the training set on the learned model Cook and Weisenberg (1982). Popular examples of such methods include Influence Functions Koh and Liang (2017) and Data-Shapley Ghorbani and Zou (2019); Ghorbani et al. (2020). These methods offer the advantage of being flexible enough to be used with a wide variety of models. They produce scores that describe what the model could have predicted if some examples were absent from the training set. This is very interesting in a data valuation perspective. However, in an explanation perspective, it is not clear how to reconstruct the model predictions with these importance scores.

So far, we have only discussed works that provide explanations of a model output, which is the tip of the iceberg. Modern machine learning models involve many convoluted transformations to deduce the output from an input. These transformations are expressed in terms of intermediate variables that are often called latent variables. Some treatment of these latent variables is necessary if we want to provide explanations that take the model complexity into account. This motivates several works that push the explainability task beyond the realm of model outputs. Among the most noticeable contributions in this endeavour, we cite Concept Activation Vectors

that create a dictionary between human friendly concepts (such as the presence of stripes in an image) and their representation in terms of latent vectors 

Kim et al. (2017). Another interesting contribution is the Deep k-Nearest Neighbors model that contextualizes the prediction for an example with its Nearest Neighbours in the space of latent variables, the latent space Papernot and McDaniel (2018). An alternative exploration of the latent space is offered by the representer theorem that allows, under restrictive assumptions, to use latent vectors to decompose a model’s prediction in terms of its training examples Yeh et al. (2018).

Figure 1: An example of corpus decomposition with SimplEx.

Contribution In this work, we introduce a novel approach called SimplEx that lies at the crossroad of the above research directions. SimplEx outputs post-hoc explanations in the form of Figure 1, where the model’s prediction and latent representation for a test example is approximated as a mixture of examples extracted from a corpus of examples. In each case, SimplEx highlights the role played by each feature of each corpus example in the latent space decomposition. SimplEx centralizes many functionalities that, to the best of our knowledge, constitute a leap forward from the previous state of the art. (1) SimplEx gives the user freedom to choose the corpus of examples whom with the model’s predictions are decomposed. Unlike previous methods such as the representer theorem, there is no need for this corpus of examples to be equal to the model’s training set. This is particularly interesting for two reasons: (a) the training set of a model is not always accessible (b) the user might want explanations in terms of examples that make sense for them. For instance, a doctor might want to understand the predictions of a risk model in terms of patients they know. (2) The decompositions of SimplEx are valid, both in latent and output space. We show that, in both cases, the corpus mixtures discovered by SimplEx offer significantly more precision and robustness than previous methods such as Deep k-Nearest Neighbors and the representer theorem. (3) SimplEx details the role played by each feature in the corpus mixture. This is done by introducing Integrated Jacobians, a generalization of Integrated Gradients that makes the contribution of each corpus feature explicit in the latent space decomposition. This creates a bridge between two research directions that have mostly developed independently: feature importance and example-based explanations Keane and Kenny (2019a, b). In Section C of the supplementary material, we report a user-study involving 10 clinicians. This study supports the significance of our contribution.

2 SimplEx

In this section, we formulate our method rigorously. Our purpose is to explain the black-box prediction for an unseen test example with the help of a set of known examples that we call the corpus. We start with a clear statement of the family of black-boxes for which our method applies. Then, we detail how the set of corpus examples can be used to decompose a black-box representation for the unseen example. Finally, we show that the corpus decomposition can offer explanations at the feature level.

2.1 Preliminaries

Let be an input (or feature) space and be an output (or label) space, where and are respectively the dimension of the input and the output space. Our task is to explain individual predictions of a given black-box . In order to build our explainability method, we need to make an assumption on the family of black-boxes that we wish to interpret.

Assumption 2.1 (Black-box Restriction).

We restrict to black-boxes that can be decomposed as , where maps an input to a latent vector and linearly maps111The map can in fact be affine. In the following, we omit the bias term that can be reabsorbed in g. a latent vector to an output . In the following, we call the latent space. Typically, this space has higher dimension than the output space .

Remark 2.1.

In the context of deep-learning, this assumption requires that the last hidden layer maps linearly to the output. While it is often the case, it is crucial in the following since we will use the fact that linear combinations in latent space correspond to linear combinations in output space. Our purpose is to gain insights on the structure of the latent space.

Remark 2.2.

This assumption is compatible with regression and classification models, we just need to clarify what we mean by output in the case of classification. If f

is a classification black-box that predicts the probabilities for each class, it will typically take the form in Assumption 

2.1 up to a normalizing map (typically a softmax): . In this case, we ignore222There is no loss of information as the output allows us to reconstruct class probabilities p via . the normalizing map and define the output to be .

Our explanations for f rely on a set of examples that we call the corpus. These examples will typically (but not necessarily) be a representative subset of the black-box training set. The corpus set has to be understood as a set of reference examples that we want to use as building blocks to interpret unseen examples. In order to index these examples, it will be useful to denote by the set of natural numbers between the natural numbers and with . Further, we denote the set of natural numbers between and . The corpus of examples is a set containing examples . In the following, superscripts are labels for examples and subscripts are labels for vector components. In this way, has to be understood as the component of corpus example .

2.2 A corpus of examples to explain a latent representation

Our purpose is to understand a prediction for an unseen test example x with the help of the corpus. How can we decompose the prediction in terms of corpus predictions ? A naive attempt would be to express x as a mixture of inputs from the corpus : with weights that sum to one . The weakness of this approach is that the signification of the mixture weights is not conserved if the black-box f is not a linear map: .

Fortunately, Assumption 2.1 offers us a better vector space to perform a corpus decomposition of the unseen example x. We first note that the map g induces a latent representation of the corpus . Similarly, x has a latent representation . Following the above line of reasoning, we could therefore perform a corpus decomposition in latent space . Now, by using the linearity of l, we can compute the black-box output of this mixture in latent space: . In this case, the weights that are used to decompose the latent representation h in terms of the latent representation of the corpus also reflect the way in which the black-box prediction can be decomposed in terms of the corpus outputs . This hints that the latent space is endowed with the appropriate geometry to make corpus decompositions. More formally, we think in terms of the convex hull spanned by the corpus.

Definition 2.1 (Corpus Hull).

The corpus convex hull spanned by a corpus with latent representation is the convex set

Remark 2.3.

This is the set of latent vectors that are a mixture of the corpus latent vectors.

At this stage, it is important to notice that an exact corpus decomposition is not possible if . In such a case, the best we can do is to find the element that best approximates h. If is endowed with a norm , this corresponds to the convex optimization problem


By definition, the corpus representation of h can be expanded333Note that this decomposition might not be unique, more details in Section A of the supplementary material. as a mixture of elements from : . The weight can naturally be interpreted as a measure of importance in the reconstruction of h with the corpus. Clearly, for some indicates that does not play a significant role in the corpus representation of h. On the other hand, indicates that generates the corpus representation by itself.

At this stage, a natural question arises: how can we know if the corpus approximation is a good approximation for h? The answer is given by the residual vector that measures the shift between the latent representation and the corpus hull . It is natural to use this residual vector to detect examples that cannot be explained with the selected corpus of examples .

Definition 2.2 (Corpus Residual).

The corpus residual associated to a latent vector and its corpus representation solving (1) is the quantity

Figure 2: Corpus convex hull and residual.

In Section A.1 of the supplementary material, we show that the corpus residual also controls the quality of the corpus approximation in output space . All the corpus-related quantities that we have introduced so far are summarized visually in Figure 2. Note that this Figure is a simplification of the reality as will typically be larger than 3 and will typically be higher than 2. We are now endowed with a rigorous way to decompose a test example in terms of corpus examples in latent space. In the next section, we detail how to pull-back this decomposition to input space.

2.3 Transferring the corpus explanation in input space

Now that we are endowed with a corpus decomposition that approximates h, it would be convenient to have an understanding of the corpus decomposition in input space . For the sake of notation, we will assume that the corpus approximation is good so that it is unnecessary to draw a distinction between the latent representation h of the unseen example x and its corpus decomposition . If we want to understand the corpus decomposition in input space, a natural approach Sundararajan et al. (2017) is to fix a baseline input together with its latent representation . Let us now decompose the representation shift in terms of the corpus:


With this decomposition, we understand the total shift in latent space in terms of individual contributions from each corpus member. In the following, we focus on the comparison between the baseline and a single corpus example together with its latent representation by keeping in mind that the full decomposition (2) can be reconstructed with the whole corpus. To bring the discussion in input space , we interpret the shift in latent space as resulting from a shift in the input space. We are interested in the contribution of each feature to the latent space shift. To decompose the shift in latent space in terms of the features, we parametrize the shift in input space with a line that goes from the baseline to the corpus example: for . Together with the black-box, this line induces a curve in latent space that goes from the baseline latent representation to the corpus example latent representation . Let us now use an infinitesimal decomposition of this curve to make the contribution of each input feature explicit. If we assume that g is differentiable at , we can use a first order approximation of the curve at the vicinity of to decompose the infinitesimal shift in latent space:

where we used to obtain the second equality. In this decomposition, each input feature contributes additively to the infinitesimal shift in latent space. It follows trivially that the contribution of the input feature corresponding to input dimension is given by

In order to compute the overall contribution of feature to the shift, we let and we sum the infinitesimal contributions along the line . If we assume444

This is not restrictive, DNNs with ReLU activation functions satisfy this assumption for instance.

that g is almost everywhere differentiable, this sum converges to an integral in the limit . This motivates the following definitions.

Definition 2.3 (Integrated Jacobian & Projection).

The integrated Jacobian between a baseline and a corpus example associated to feature is

where for . This vector indicates the shift in latent space induced by feature of corpus example when comparing the corpus example with the baseline. To summarize this contribution to the shift described in (2), we define the projected Jacobian

where is an inner product for and the normalization is chosen for the purpose of Proposition 2.1.

Remark 2.4.

The integrated Jacobian can be seen as a latent-space generalization of Integrated Gradients Sundararajan et al. (2017). In Section A.3 of the supplementary material, we establish the relationship between the two quantities: .

Figure 3: Integrated Jacobian and projection.

We summarize the Jacobian quantities in Figure 3. By inspecting the figure, we notice that projected Jacobians encode the contribution of feature from corpus example to the overall shift in latent space: implies that this feature creates a shift pointing in the same direction as the overall shift; implies that this feature creates a shift pointing in the opposite direction and implies that this feature creates a shift in an orthogonal direction. We use the projections to summarize the contribution of each feature in Figures 1 , 8 & 9. The colors blue and red indicate respectively a positive and negative projection. In addition to these geometrical insights, Jacobian quantities come with natural properties.

Proposition 2.1 (Properties of Integrated Jacobians).

Consider a baseline and a test example together with their latent representation . If the shift admits a decomposition (2), the following properties hold.


The proof is provided in Section A.4 of the supplementary material. ∎

These properties show that the integrated Jacobians and their projections are the quantities that we are looking for: they transfer the corpus explanation into input space. The first equality decomposes the shift in latent space in terms of contributions arising from each feature of each corpus example. The second equality sets a natural scale to the contribution of each feature. For this reason, it is natural to use to measure the contribution of feature of corpus example .

3 Experiments

In this section, we evaluate quantitatively several aspects of our method. In a first experiment, we verify that the corpus decomposition scheme described in Section 2 yields good approximations for the latent representation of test examples extracted from the same dataset as the corpus examples. In a realistic clinical use case, we illustrate the usage of SimplEx in a set-up where different corpora reflecting different datasets are used. The experiments are summarized below. In Section B of the supplementary material, we provide more details and further experiments with time series and synthetic data. The code for our method and experiments is available on the Github repository https://github.com/JonathanCrabbe/Simplex. All the experiments have been replicated on different machines.

3.1 Precision of corpus decomposition

(a) score for the latent approximation
(b) score for the output approximation
Figure 4: Precision of corpus decomposition for prostate cancer (avg std).
(a) score for the latent approximation
(b) score for the output approximation
Figure 5:

Precision of corpus decomposition for MNIST (avg


Description The purpose of this experiment is to check if the corpus decompositions described in Section 2 allows us to build good approximations of the latent representation of test examples. We start with a dataset that we split into a training set and a testing set . We train a black-box f for a given task on the training set . We randomly sample a set of corpus examples from the training set (we omit the true labels for the corpus examples) and a set of test examples from the testing set . For each test example , we build an approximation for with the corpus examples latent representations. In each case, we let the method use only corpus examples to build the approximation. We repeat the experiment for several values of .

Metrics We are interested in measuring the precision of the corpus approximation in latent space and in output space. To that aim, we use the score in both spaces. In this way, measures the precision of the corpus approximation with respect to the true latent representation h. Similarly, measures the precision of the corpus approximation with respect to the true output . Both of these metrics satisfy . A higher score is better with corresponding to a perfect approximation. All the metrics are computed over the test examples

. The experiments are repeated 10 times to report standard deviations across different runs.

Baselines We compare our method555To enforce SimplEx to select examples, we add a penalty making the smallest weights vanish. (SimplEx) with 3 baselines. A first approach, inspired by Papernot and McDaniel (2018), consists in using the -nearest corpus neighbours in latent space to build the latent approximation

. Building on this idea, we introduce two baselines (1) KNN Uniform that takes the average latent representation of the

-nearest corpus neighbours of h in latent space (2) KNN Distance that computes the same average with weights inversely proportional to the distance . Finally, we use the representer theorem Yeh et al. (2018) to produce an approximation of y with the corpus . Unlike the other methods, the representer theorem does not allow to produce an approximation in latent space.

Datasets We use two different datasets with distinct tasks for our experiment: (1) 240,486 patients enrolled in the American SEER program National Cancer Institute (2019)

. We consider the binary classification task of predicting cancer mortality for patients with prostate cancer. We train a multilayer perceptron (MLP) for this task. Since this task is simple, we show that a corpus of

patients yields good approximations. (2) 70,000 MNIST images of handwritten digits Deng (2012)

. We consider the multiclass classification task of identifying the digit represented on each image. We train a convolutional neural network (CNN) for the image classification. This classification task is more complex than the previous one (higher

and ), we show that a corpus of images yields good approximations in this case.

Results The results for SimplEx and the KNN baselines are presented in Figure 4 & 5. Several things can be deduced from these results: (1) It is generally harder to produce a good approximation in latent space than in output space as for most examples (2) SimplEx produces the most accurate approximations, both in latent and output space. These approximations are of high quality with . (3) The trends are qualitatively different between SimplEx and the other baselines. The accuracy of SimplEx increases with and stabilizes when a small number of corpus members contribute ( in both cases). The accuracy of the KNN baselines increases with , reaches a maximum for a small and steadily decreases for larger . This can be understood easily: when increases beyond the number of relevant corpus examples, irrelevant examples will be added in the decomposition. SimplEx will typically annihilate the effect of these irrelevant examples by setting their weights to zero in the corpus decomposition. The KNN baselines include the irrelevant corpus members in the decomposition, which alters the quality of the approximation. This suggests that has to be tuned for each example with KNN baselines, while the optimal number of corpus examples to contribute is learned by SimplEx. (4) The standard deviations indicate that the performances of SimplEx are more consistent across different runs. This is particularly true in the prostate cancer experiment, where the corpus size is smaller. This suggests that SimplEx is more robust than the baselines. (5) For the representer theorem, we have for the prostate cancer dataset and

for MNIST. This corresponds to poor estimations of the black-box output. We propose some hypotheses to explain this observation in Section 

B.1 of the supplementary material.

3.2 Significance of Jacobian Projections

Description The purpose of this experiment is to check if SimplEx’s Jacobian Projections are a good measure of the importance for each corpus feature in constructing the test latent representation h. In the same setting as in the previous experiment, we start with a corpus of MNIST images. We build a corpus approximation for an example with latent representation . The precision of this approximation is reflected by its corpus residual . For each corpus example , we would like to identify the features that are the most important in constructing the corpus decomposition of h. With SimplEx, this is reflected by the Jacobian Projections . We evaluate these scores for each feature of each corpus example . For each corpus image , we select the most important pixels according to the Jacobian Projections and the baseline. In each case, we build a mask that replaces these most important pixels by black pixels. This yields a corrupted corpus image , where denotes the Hadamard product. By corrupting all the corpus images, we obtain a corrupted corpus . We analyse how well this corrupted corpus approximates h, this yields a residual .

Metric We are interested in measuring the effectiveness of the corpus corruption. This is reflected by the metric

. A higher value for this metric indicates that the features selected by the saliency method are more important for the corpus to produce a good approximation of

h in latent space. We repeat this experiment for 100 test examples and for different numbers of perturbed pixels.

Baseline As a baseline for our experiment, we use Integrated Gradients, which is close in spirit to our method. In a similar fashion, we compute the Integrated Gradients for each feature of each corpus example and construct a corrupted corpus based on these scores.

Results The results are presented in the form of box plots in Figure 6. We observe that the corruptions induced by the Jacobian Projections are significantly more impactful when few pixels are perturbed. The two methods become equivalent when more pixels are perturbed. This demonstrates that Jacobian Projections are more suitable to measure the importance of features when performing a latent space reconstruction, as it is the case for SimplEx.

Figure 6: Increase in the corpus residual caused by each method (higher is better).

3.3 Use case: clinical risk model across countries

Very often, clinical risk models are produced and validated with the data of patients treated at a single site Wu et al. (2021). This can cause problems when these models are deployed at different sites for two reasons: (1) Patients from different sites can have different characteristics (2) Rules that are learned for one site might not be true for another site. One possible way to alleviate this problem would be to detect patients for which the model prediction is highly extrapolated and/or ambiguous. In this way, doctors from different sites can make an enlightened use of the risk model rather than blindly believing the model’s predictions. We demonstrate that SimplEx provides a natural framework for this set-up.

As in the previous experiment, we consider a dataset containing patients enrolled in the American SEER program National Cancer Institute (2019). We train and validate an MLP risk model with . To give a realistic realization of the above use-case, we assume that we want to deploy this risk model in a different site: the United Kingdom. For this purpose, we extract from the set of 10,086 patients enrolled in the British Prostate Cancer UK program UK (2019). These patients are characterized by the same features for both and . However, the datasets and differ by a covariate shift: patients from are in general older and at earlier clinical stages.

When comparing the two populations in terms of the model, a first interesting question to ask is whether the covariate shift between and affects the model representation. To explore this question, we take a first corpus of American patients . If there is indeed a difference in terms of the latent representations, we expect the representations of test examples from to be less closely approximated by their decomposition with respect to . If this is true, the corpus residuals associated to examples of will typically larger than the ones associated to . To evaluate this quantitatively, we consider a mixed set of test examples sampled from both and : . We sample 100 examples from both sources: . We then approximate the latent representation of each example and compute the associated corpus residual . We sort the test examples from by decreasing order of corpus residual and we use this sorted list to see if we can detect the examples from . We use previous baselines for comparison, results are shown in Figure 7.

Figure 7: Detecting UK patients (avg.std.).

Several things can be deduced from this experiment. (1) The results strongly suggest that the difference between the two datasets and is reflected in their latent representations. (2) The corpus residuals from SimplEx offer the most reliable way to detect examples that are different from the corpus examples . None of the methods matches the maximal baseline since some examples of resemble examples from . (3) When the corpus examples are representative of the training set, as it is the case in the experiment, our approach based on SimplEx provides a systematic way to detect test examples that have representations that are different from the ones produced at training time. A doctor should be more sceptical with respect to model predictions associated to larger residual with respect to as these arise from an extrapolation region of the latent space.

Let us now make the case more concrete. Suppose that an American and a British doctor use the above risk model to predict the outcome for their patients. Each doctor wants to decompose the predictions of the model in terms of patients they know. Hence, the American doctor selects a corpus of American patients and the British doctor selects a corpus of British patients . Both corpora have the same size . We suppose that the doctors know the model prediction and the true outcome for each patient in their corpus. Both doctors are sceptical about the risk model and want to use SimplEx to decide when it can be trusted. This leads them to a natural question: is it possible to anticipate misclassification with the help of SimplEx?

In Figure 8 & 9, we provide two typical examples of misclassified British patients from together with their decomposition in terms of the two corpora and . These two examples exhibit two qualitatively different situations. In Figure 8, both the American and the British doctors make the same observation: the model relates the test patient to corpus patients that are mostly misclassified by the model. With the help of SimplEx, both doctors will rightfully be sceptical with respect to the model’s prediction.

In Figure 9

, something even more interesting occurs: the two corpus decompositions suggest different conclusions. In the American doctor’s perspective, the prediction for this patient appears perfectly coherent as all patients in the corpus decomposition have very similar features and all of them are rightfully classified. On the other hand, the British doctor will reach the opposite conclusion as the most relevant corpus patient is misclassified by the model. In this case, we have a perfect illustration of the limitation of the transfer of a risk model from one site (America) to another (United Kingdom): similar patients from different sites can have different outcomes. In both cases, since the test patient is British, only the decomposition in terms of

really matters. In both cases, the British doctor could have anticipated the misclassification of each patient with SimplEx.

Figure 8: A first misclassified patient.
Figure 9: A second misclassified patient.

4 Discussion

We have introduced SimplEx, a method that decomposes the model representations at inference time in terms of a corpus. Through several experiments, we have demonstrated that these decompositions are accurate and can easily be personalized to the user. Finally, by introducing Integrated Jacobians, we have brought these explanations to the feature level.

We believe that our bridge between feature and example-based explainability opens up many avenues for the future. A first interesting extension would be to investigate how SimplEx can be used to understand latent representations involved in unsupervised learning. For instance, SimplEx could be used to study the interpretability of


latent representations learned by autoencoders 

Ji et al. (2017). A second interesting possibility would be to design a rigorous scheme to select the optimal corpus for a given model and dataset. Finally, a formulation where we allow the corpus to vary on the basis of observations would be particularly interesting for online learning.


The authors are grateful to Alicia Curth, Krzysztof Kacprzyk, Boris van Breugel, Yao Zhang and the 4 anonymous NeurIPS 2021 reviewers for their useful comments on an earlier version of the manuscript. Jonathan Crabbé would like to thank Bogdan Cebere for replicating the experiments in this manuscript. Jonathan Crabbé would like to acknowledge Bilyana Tomova for many insightful discussions and her constant support. Jonathan Crabbé is funded by Aviva. Fergus Imrie is supported by by the National Science Foundation (NSF), grant number 1722516. Zhaozhi Qian and Mihaela van der Schaar are supported by the Office of Naval Research (ONR), NSF 1722516.


  • [1] A. Abadie, A. Diamond, and J. Hainmueller (2010) Synthetic control methods for comparative case studies: Estimating the effect of California’s Tobacco control program. Journal of the American Statistical Association 105 (490), pp. 493–505. External Links: Document, ISSN 01621459, Link
  • [2] A. Abadie (2020) Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects. Journal of Economic Literature. External Links: Document, ISSN 0022-0515
  • [3] M. Amjad, D. Shah, and D. Shen (2018) Robust Synthetic Control. Journal of Machine Learning Research 19 (22), pp. 1–51. External Links: ISSN 1533-7928, Link
  • [4] R. Anirudh, J. J. Thiagarajan, R. Sridhar, and P. Bremer (2017) MARGIN: Uncovering Deep Neural Networks using Graph Signal Analysis. arXiv. External Links: 1711.05407, Link
  • [5] S. Athey, M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi (2017) Matrix Completion Methods for Causal Panel Data Models. arXiv. External Links: 1710.10251, Link
  • [6] A. Barredo Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, and F. Herrera (2020) Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 58, pp. 82–115. External Links: Document, ISSN 15662535, Link Cited by: §1.
  • [7] I. Bichindaritz and C. Marling (2006) Case-based reasoning in the health sciences: What’s next?. In Artificial Intelligence in Medicine, Vol. 36, pp. 127–135. External Links: Document, ISSN 09333657 Cited by: §1.
  • [8] R. Caruana, H. Kangarloo, J. D. Dionisio, U. Sinha, and D. Johnson (1999) Case-based explanation of non-case-based learning methods.. AMIA Symposium, pp. 212–215. External Links: ISSN 1531605X, Link Cited by: §1.
  • [9] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P. Agapow, M. Zietz, M. M. Hoffman, W. Xie, G. L. Rosen, B. J. Lengerich, J. Israeli, J. Lanchantin, S. Woloszynek, A. E. Carpenter, A. Shrikumar, J. Xu, E. M. Cofer, C. A. Lavender, S. C. Turaga, A. M. Alexandari, Z. Lu, D. J. Harris, D. DeCaprio, Y. Qi, A. Kundaje, Y. Peng, L. K. Wiley, M. H. S. Segler, S. M. Boca, S. J. Swamidass, A. Huang, A. Gitter, and C. S. Greene (2018) Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface 15 (141), pp. 20170387. External Links: Document, ISBN 0000000305396, ISSN 1742-5689, Link Cited by: §1.
  • [10] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik (2017) EMNIST: an extension of MNIST to handwritten letters. arXiv. External Links: 1702.05373, Link Cited by: §B.3.
  • [11] R. D. Cook and S. Weisenberg (1982) Residuals and influence in regression. New York: Chapman and Hall. Cited by: §1.
  • [12] J. Crabbé and M. van der Schaar (2021) Explaining Time Series Predictions with Dynamic Masks. International Conference on Machine Learning 38. External Links: 2106.05303, Link Cited by: §A.5, §1.
  • [13] A. Das and P. Rad (2020) Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey. arXiv. External Links: 2006.11371, Link Cited by: §1.
  • [14] A. Datta, S. Sen, and Y. Zick (2016) Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems. In Proceedings - 2016 IEEE Symposium on Security and Privacy, SP 2016, pp. 598–617. External Links: Document, ISBN 9781509008247 Cited by: §1.
  • [15] L. Deng (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §B.1, §3.1.
  • [16] A. Dhurandhar, P. Chen, R. Luss, C. Tu, P. Ting, K. Shanmugam, and P. Das (2018) Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives. Advances in Neural Information Processing Systems, pp. 592–603. External Links: 1802.07623, Link Cited by: §1.
  • [17] R. C. Fong and A. Vedaldi (2017) Interpretable Explanations of Black Boxes by Meaningful Perturbation.

    Proceedings of the IEEE International Conference on Computer Vision

    , pp. 3449–3457.
    External Links: Document, 1704.03296, ISBN 9781538610329, ISSN 15505499 Cited by: §1.
  • [18] R. Fong, M. Patrick, and A. Vedaldi (2019) Understanding deep networks via extremal perturbations and smooth masks. Proceedings of the IEEE International Conference on Computer Vision, pp. 2950–2958. External Links: Document, 1910.08485, ISBN 9781728148038, ISSN 15505499 Cited by: §A.5, §1.
  • [19] A. Ghorbani, M. P. Kim, and J. Zou (2020) A Distributional Framework for Data Valuation. arXiv. External Links: 2002.12334, Link Cited by: §1.
  • [20] A. Ghorbani and J. Zou (2019) Data Shapley: Equitable Valuation of Data for Machine Learning. 36th International Conference on Machine Learning, ICML 2019, pp. 4053–4065. External Links: 1904.02868, Link Cited by: §1.
  • [21] J. Gordetsky and J. Epstein (2016) Grading of prostatic adenocarcinoma: current state and prognostic implications. Diagnostic Pathology 11 (1). External Links: Document, Link Cited by: §B.2.
  • [22] K. S. Gurumoorthy, A. Dhurandhar, G. Cecchi, and C. Aggarwal (2017) Efficient Data Representation by Selecting Prototypes with Importance Weights. Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 260–269. External Links: 1707.01212, Link Cited by: §1.
  • [23] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid (2017) Deep Subspace Clustering Networks. Advances in Neural Information Processing Systems, pp. 24–33. External Links: 1709.02508, Link Cited by: §4.
  • [24] M. T. Keane and E. M. Kenny (2019)

    How Case-Based Reasoning Explains Neural Networks: A Theoretical Analysis of XAI Using Post-Hoc Explanation-by-Example from a Survey of ANN-CBR Twin-Systems

    In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11680 LNAI, pp. 155–171. External Links: Document, ISBN 9783030292485, ISSN 16113349, Link Cited by: §1, §1.
  • [25] M. T. Keane and E. M. Kenny (2019) The Twin-System Approach as One Generic Solution for XAI: An Overview of ANN-CBR Twins for Explaining Deep Learning. arXiv. External Links: 1905.08069, Link Cited by: §1.
  • [26] B. Kim, R. Khanna, and O. Koyejo (2016) Examples are not Enough, Learn to Criticize! Criticism for Interpretability. Advances in Neural Information Processing Systems 29. Cited by: §1.
  • [27] B. Kim, C. Rudin, and J. Shah (2015) The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification. Advances in Neural Information Processing Systems 3, pp. 1952–1960. External Links: 1503.01161, Link Cited by: §1.
  • [28] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres (2017) Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). 35th International Conference on Machine Learning, ICML 2018 6, pp. 4186–4195. External Links: 1711.11279, Link Cited by: §1.
  • [29] P. W. Koh and P. Liang (2017) Understanding Black-box Predictions via Influence Functions. 34th International Conference on Machine Learning, ICML 2017 4, pp. 2976–2987. External Links: 1703.04730, Link Cited by: §B.5, §1.
  • [30] Z. C. Lipton (2016) The Mythos of Model Interpretability. Communications of the ACM 61 (10), pp. 35–43. External Links: 1606.03490, Link Cited by: §1.
  • [31] S. Lundberg and S. Lee (2017) A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems, pp. 4766–4775. External Links: 1705.07874, Link Cited by: §1.
  • [32] S. R. P. National Cancer Institute (2019) Surveillance, epidemiology, and end results (seer) program. Note: www.seer.cancer.gov Cited by: §B.1, §3.1, §3.3.
  • [33] G. Nguyen, D. Kim, and A. Nguyen (2021) The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Advances in Neural Information Processing Systems 35. External Links: 2105.14944, Link Cited by: §1.
  • [34] N. Papernot and P. McDaniel (2018) Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. arXiv. External Links: 1803.04765, Link Cited by: §1, §3.1.
  • [35] A. Rai (2019) Explainable AI: from black box to glass box. Journal of the Academy of Marketing Science 2019 48:1 48 (1), pp. 137–141. External Links: Document, ISSN 1552-7824, Link Cited by: §1.
  • [36] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) "Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. External Links: Document, 1602.04938, ISBN 9781450342322, Link Cited by: §1.
  • [37] L. Shapley (1953) A value for n-person games. Contributions to the Theory of Games 2 (28), pp. 307–317. Cited by: §1.
  • [38] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic Attribution for Deep Networks. 34th International Conference on Machine Learning, ICML 2017 7, pp. 5109–5118. External Links: 1703.01365, Link Cited by: §A.3, §1, §2.3, Remark 2.4.
  • [39] E. Tjoa and C. Guan (2020) A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: Document, 1907.07374, ISSN 2162-237X Cited by: §1.
  • [40] P. C. UK (2019) CUTRACT. Note: www.prostatecanceruk.org Cited by: §B.2, §3.3.
  • [41] E. Wu, K. Wu, R. Daneshjou, D. Ouyang, D. E. Ho, and J. Zou (2021) How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nature Medicine. External Links: Document, Link Cited by: §3.3.
  • [42] C. Yeh, J. S. Kim, I. E. H. Yen, and P. Ravikumar (2018) Representer Point Selection for Explaining Deep Neural Networks. Advances in Neural Information Processing Systems, pp. 9291–9301. External Links: 1811.09720, Link Cited by: §B.1, §1, §3.1.

Appendix A Supplement for Mathematical Formulation

In this supplementary section, we give more details on the mathematical aspects underlying SimplEx.

a.1 Precision of the corpus approximation in output space

If the corpus representation of has a residual , Assumption 2.1 controls the error between the black-box prediction for the test example and and its corpus representation .

Proposition A.1 (Precision in output space).

Consider a latent representation h with corpus residual . If Assumption 2.1 holds, this implies that the corpus prediction approximates with a precision controlled by the corpus residual:

where is a norm on and is the usual operator norm.


The proof is immediate:

where we have successively used the linearity of l, the definition of the operator norm and Definition 2.2. ∎

a.2 Uniqueness of corpus decomposition

As we have mentioned in the main paper, the corpus decomposition is not always unique. To illustrate, we consider the following corpus representation: . Consider the following vector in the corpus hull: . We note that this vector can also be written as . In other words, the vector admits more than one corpus decomposition. This is not a surprise for the attentive reader: by paying a closer look to , we note that is somewhat redundant as it is itself a combination of and . The multiplicity of the corpus decomposition results from a redundancy in the corpus representation.

To make this reasoning more general, we need to revisit some classic concepts of convex analysis. To establish a sufficient condition that guarantees the uniqueness of corpus decompositions, we recall the definition of affine independence.

Definition A.1 (Affine independence).

The vectors are affinely independent if

If a set of vectors is not affinely independent, it means that one of the vectors can be written as an affine combination of the others. This is precisely what we called a redundancy in the previous paragraph. We now adapt a well-known result of convex analysis to our formalism:

Proposition A.2 (Uniqueness of corpus decomposition).

If the corpus representation is a set of affinely independent vectors, then every vector in the corpus hull admits one unique corpus decomposition.


The existence of a decomposition is a trivial consequence of the definition of . We prove the uniqueness of the decomposition by contradiction. Let us assume that a vector admits two distinct corpus decompositions:

where for all , and . It follows that:

But . It follows that is not affinely independent, a contradiction. ∎

This shows that affine independence provides a sufficient condition to ensure the uniqueness of corpus decompositions. If one wants to produce such a corpus, a possibility is to gradually add new examples in the corpus by checking that the latent representation of each new example is not an affine combination of the previous latent representations. Clearly, the number of examples in such a corpus cannot exceed .

a.3 Integrated Jacobian and Integrated Gradients

Integrated Gradients is a notorious method used to discuss feature saliency [38]. It uses a black-box output to attribute a saliency score to each feature. In the original paper, the output space is assumed to be one-dimensional: . We shall therefore relax the bold notation that we have used for the outputs so far. In this way, the black-box is denoted and the latent-to-output map is denoted . Although the original paper makes no mention of corpus decompositions, it is straightforward to adapt the definition of Integrated Gradients to our set-up:

Definition A.2 (Integrated Gradient).

The Integrated Gradient between a baseline an a corpus example associated to feature is

where for .

In the main paper, we have introduced Integrated Jacobians: a latent space generalization of Integrated Gradients. We use the word generalization for a reason: the Integrated Gradient can be deduced from the Integrated Jacobian but not the opposite666Unless in the degenerate case where . However, this case is of little interest as it describes a situation where and are isomorphic, hence the distinction between output and latent space is fictional.. We make the relationship between the two quantities explicit in the following proposition.

Proposition A.3.

The Integrated Gradient can be deduced from the Integrated Jacobian via


We start from the definition of the Integrated Gradient:

where we have successively used: Assumption 2.1, the linearity of the partial derivative, the linearity of the integration operator, the linearity of and the definition of Integrated Jacobians. ∎

Note that Integrated Jacobians allow us to push our understanding of the black-box beyond the output. There is very little reason to expect a one dimensional output to capture the model complexity. As we have argued in the introduction, our paper pursues the more challenging ambition of gaining a deeper understanding of the black-box latent space.

a.4 Properties of Integrated Jacobians

We give a proof for the proposition appearing in the main paper.

Proposition 2.1 (Properties of Integrated Jacobians).

Consider a baseline and a test example together with their latent representation . If the shift admits a decomposition (2), the following properties hold.


Let us begin by proving (A). By using the chain rule for a given corpus example

, we write explicitly the derivative of the curve with respect to its parameter :

where we used to obtain the second equality. We use this equation to rewrite the sum of the Integrated Jacobians for this corpus example :

where we have successively used: the linearity of integration, the explicit expression for the curve derivative, the fundamental theorem of calculus and the definition of the curve . We are now ready to derive (A):

where we have successively used the exact expression for the sum of Integrated Jacobians associated to corpus example and the definition of the corpus decomposition of h. We are done with (A), let us now prove (B). We simply project both members of (A) on the overall shift . Projecting the left-hand side of (A) yields:

where we used the linearity of the projection operator. Projecting the right-hand side of (A) yields:

By equating the projected version of both members of (A), we deduce (B). ∎

a.5 Pseudocode for SimplEx

We give the pseudocode for the two modules underlying SimplEx: the corpus decomposition (Algorithm 1) and the evaluation of projected Jacobians (Algorithm 2).

Input: Test latent representation h ; Corpus representation
Result: Weights of the corpus decomposition ; Corpus residual
Initialize pre-weights: ;
while optimizing do
       Normalize pre-weights: ;
       Evaluate loss: ;
       Update pre-weights: ;
end while
Return normalized weights: ;
Return corpus residual: ;
Algorithm 1 SimplEx: Corpus Decomposition

Where we used a vector notation for the pre-weights and the weights:

. For the Adam optimizer, we use the default hyperparameters in the Pytorch implementation:

; ; ; . Note that this algorithm is a standard optimization loop for a convex problem where the normalization of the weights is ensured by using a softmax.

When the size of the corpus elements to contribute has to be limited to , we use a similar strategy as the one used to produce extremal perturbations [18, 12]. This consists in adding the following term to the optimized loss :

where vecsort is a permutation operator that sorts the components of a vector in ascending order. The notation refers to the component of the sorted vector. This regularization term will impose sparsity for the smallest weights of the corpus decomposition. As a result, the optimal corpus decomposition only involves non-vanishing weights. We now focus on the evaluation of the Projected Jacobian.

Input: Test input x ; Test representation h ; Corpus ; Corpus representation ; Baseline input ; Baseline representation ; Black-box latent map g ; Number of bins
Result: Jacobian projections
Initialize the projection matrix: ;
Form a matrix of corpus inputs: ;
Form a matrix of baseline inputs: ;
for  do
       Set the evaluation input: ;
       Increment the Jacobian projections: ;
end for
Apply the appropriate pre-factor: ;
Algorithm 2 SimplEx: Projected Jacobian

This algorithm approximates the integral involved in the definition of the Projected Jacobian with a standard Riemann sum. Note that the definition of implies that the baseline vector is broadcasted along the first dimension of the matrix. More explicitly, the components of this matrix are for and . Also note that the projected Jacobians can be computed in parallel with packages such as Pytorch’s autograd. We have used the notation to denote the conventional Hadarmard product. In our implementation, the number of bins is fixed at , bigger don’t significantly improve the precision of the Riemann sum.

a.6 Choice of a baseline for Integrated Jacobians

Throughout our analysis of the corpus decomposition, we have assumed the existence of a baseline . This baseline is crucial as it defines the starting point of the line that we use to compute the Jacobian quantities. What is a good choice for the baseline? The answer to this question depends on the domain. When this makes sense, we choose the baseline to be an instance that does not contain any information. A good example of this is the baseline that we use for MNIST: an image that is completely black . Sometimes, this absence of information is not well-defined. A good example is the prostate cancer experiment: it makes little sense to define a patient whose features contain no information. In this set-up, our baseline is a patient whose features are fixed to their average value in the training 777This average could also be computed with respect to the corpus itself. set