Shapley-based explainability on the data manifold

by   Christopher Frye, et al.

Explainability in machine learning is crucial for iterative model development, compliance with regulation, and providing operational nuance to model predictions. Shapley values provide a general framework for explainability by attributing a model's output prediction to its input features in a mathematically principled and model-agnostic way. However, practical implementations of the Shapley framework make an untenable assumption: that the model's input features are uncorrelated. In this work, we articulate the dangers of this assumption and introduce two solutions for computing Shapley explanations that respect the data manifold. One solution, based on generative modelling, provides flexible access to on-manifold data imputations, while the other directly learns the Shapley value function in a supervised way, providing performance and stability at the cost of flexibility. While the commonly used “off-manifold” Shapley values can (i) break symmetries in the data, (ii) give rise to misleading wrong-sign explanations, and (iii) lead to uninterpretable explanations in high-dimensional data, our approach to on-manifold explainability demonstrably overcomes each of these problems.



There are no comments yet.


page 3


Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability

Explaining AI systems is fundamental both to the development of high per...

Human-interpretable model explainability on high-dimensional data

The importance of explainability in machine learning continues to grow, ...

Threading the Needle of On and Off-Manifold Value Functions for Shapley Explanations

A popular explainable AI (XAI) approach to quantify feature importance o...

A general framework for scientifically inspired explanations in AI

Explainability in AI is gaining attention in the computer science commun...

The Bouncer Problem: Challenges to Remote Explainability

The concept of explainability is envisioned to satisfy society's demands...

Quantifying Explainability in NLP and Analyzing Algorithms for Performance-Explainability Tradeoff

The healthcare domain is one of the most exciting application areas for ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

AI’s potential to improve economic productivity is driven by its ability to significantly reduce the cost of predictions Agrawal et al. (2018). For these predictions to be beneficial, they should be mostly correct, operationally consumable, and cannot lead to unexpected systemic harm. The ability to explain how AI models make their predictions is a critical step towards this goal. The discipline of AI explainability is thus central to the practical impact of AI on society.

One could conservatively demand that only simple, by-construction-interpretable models are used for predictions that meaningfully impact people’s lives Rudin (2019). Such an approach, however, sacrifices the performance upside of complex, non-interpretable models. This motivates the study of post-hoc AI explainability, where the goal is to explain arbitrarily complex models.

Further distinction exists between model-specific and model-agnostic explainability. Model-specific methods explain a model’s predictions by referencing its internal structure; see e.g. Chen and Guestrin (2016) or Shrikumar et al. (2017). Model-agnostic methods explain predictions through input-output attribution, treating the model as a black box. Not only do model-agnostic methods offer general applicability, but they also provide a common language for explainability that does not require expert knowledge of the model.

Within the paradigm of post-hoc, model-agnostic explainability, a number of methods are used in practice. Many measure the effect of varying features on model performance Breiman (2001); Strobl et al. (2008) or an individual prediction Baehrens et al. (2010). Another method fits an interpretable model to the original around the point of prediction to garner local understanding Ribeiro et al. (2016). However, these methods are widely ad-hoc and founded on prohibitively stringent assumptions, e.g. independence or linearity.

Fortunately, the general problem of attribution, of which model-agnostic explainability is an example, has been extensively developed in cooperative game theory. Shapley values

Shapley (1953) provide the unique attribution method satisfying 4 intuitive axioms: they capture all interactions between features, they sum to the model prediction, and their linearity enables aggregation without loss of theoretical control. Shapley-based AI explainability has matured over the last two decades Lipovetsky and Conklin (2001); Kononenko and others (2010); Štrumbelj and Kononenko (2014); Datta et al. (2016); Lundberg and Lee (2017).

However, Shapley values suffer from a problematic assumption: they involve marginalisation over subsets of features, generally achieved by splicing data points together and thus evaluating the model on highly unrealistic data (e.g. Fig. 1). While such splicing is common in model-agnostic methods, it is only justified if all the data’s features are independent, an assumption almost never satisfied; otherwise, such spliced data lies off the data manifold. While work has been done towards remedying this flaw Aas et al. (2019); Rasouli and Yu (2019); Lundberg et al. (2020), a satisfactorily general and performant solution has yet to appear.

In this paper, we provide a detailed study of the off-manifold problem in explainability, and provide solutions to computing Shapley values on the data manifold. Our main contribution is the introduction of two new methods to compute on-manifold Shapley values for high-dimensional, multi-type data:

  1. [leftmargin=15pt]

  2. a flexible generative-modelling technique to learn on-manifold conditional distributions;

  3. a robust supervised-learning technique that learns the on-manifold value function directly.

In Sec. 2, we provide precise definitions of key quantities in Shapley explainability, including global Shapley values, which to our knowledge have not been introduced elsewhere. In Sec. 3, we elucidate the conceptual difference between off- and on-manifold explanations and the marked drawbacks of off-manifold approaches. After presenting our solutions in Sec. 4, we demonstrate the practical effectiveness of on-manifold explainability with varied experiments in Sec. 5.

2 Shapley values on the data manifold

Here we review the Shapley framework for model explainability, define on-manifold Shapley values precisely, and introduce global explanations that obey the Shapley axioms.

2.1 Shapley values for model explainability

In cooperative game theory, a team of players work together to earn value von Neumann and Morgenstern (1944). Given a value function indicating the value that a coalition of players would earn on their own, the Shapley value provides a principled approach to distributing credit for the total earnings among the players Shapley (1953):


The Shapley value computes player ’s marginal value-added upon joining the team, averaged over all orderings in which the team can be constructed.

In the context of supervised learning, let

represent a model’s predicted probability that data point

belongs to class , so that . To apply Shapley attribution to model explainability, one interprets the input features as players in a game and the output as their earned value. To compute the Shapley value of each feature , one must define a value function to represent the outcome of the model on a restricted coalition of inputs .

While the value function should act as a proxy for “”, the model is undefined given only partial input , so one cannot leave out-of-coalition slots empty. In the standard treatment Lundberg and Lee (2017) one averages over out-of-coalition features, where , drawn unconditionally from the data:


We refer to this value function, and the corresponding Shapley values , as lying off-manifold since the splices generically lie far from the data manifold (e.g. Fig. 1). Even so, the Shapley framework guarantees that model explanations satisfy an intuitive set of properties Shapley (1953):

  • [leftmargin=10pt]

  • Efficiency. Shapley values distribute the model prediction fully among the features, up to an offset term (not attributed to any feature) representing the average probability assigns to class :

  • Linearity. Shapley values aggregate linearly in a linear-ensemble model.

  • Nullity. Features that do not influence the value function receive zero Shapley value.

  • Symmetry. Features that influence the value function identically receive equal Shapley values.

Figure 1: An MNIST digit, a coalition of pixels in a Shapley calculation, and 5 off-manifold splices.

2.2 On-manifold Shapley values

In practice, Shapley explanations are widely based on the off-manifold value function, Eq. (2), which evaluates the model on splices, , with drawn independently of . Splicing features from unrelated data points generically leads to unrealistic model inputs. Such unrealistic splices lies outside the model’s regime of validity, where there is no reason to expect controlled model behaviour. Off-manifold explanations thus obfuscate insights into the model’s behaviour on real data.

To fix the off-manifold problem, one should condition out-of-coalition features on in-coalition features , thus basing Shapley explanations on an on-manifold value function:


Note that, since the Nullity and Symmetry axioms reference the value function directly, these properties will manifest differently off- and on-manifold; see e.g. Secs. 3.1 and 5.2.

Preference for an on-manifold value function is widely acknowledged Lundberg and Lee (2017); Hooker and Mentch (2019); Mase et al. (2019). However, the requisite conditional distribution

is not empirically accessible in practical scenarios with high-dimensional data or features that take many (e.g. continuous) values. A performant method to estimate the on-manifold value function is until-now lacking and the focus of this work.

2.3 Global Shapley values

As presented above, Shapley values provide a method for local explainability, explaining prediction on individual data point . For a global understanding of model behaviour, one might average over the data, with the class-of-interest fixed. However, for an important feature , its local Shapley value can vary between large-positive and large-negative values, as may correlate with in some regions and anti-correlate in others. As this would lead to large cancellations, it is common to average the absolute value instead Lundberg et al. (2020). However, such a nonlinear aggregation leads to a global explanation that breaks the axioms underlying the Shapley framework.

To both preserve the Shapley axioms and avoid large cancellations, we define global Shapley values:


where is the distribution from which the labelled data is drawn, and – crucially – class varies with data point in the average. Global Shapley values obey a sum rule that follows from Eq. (3):


On can thus interpret the global Shapley value as the portion of the model’s accuracy attributable to the feature. Indeed, the first term in Eq. (6) is the accuracy one achieves by drawing labels from

’s predicted probability distribution over classes. The offset term is the accuracy one achieves using

none of the features: drawing the label of from the model’s output on a random input .

3 Off- versus on-manifold Shapley values

This section articulates key differences between model explanations off versus on the data manifold.

3.1 Functional versus informational dependence

Figure 2:

Explaining shallow decision tree (a & b) and random forest (c) on Drug Consumption data.

The only effect in-coalition features have in the off-manifold value function, Eq. (2), is through their role as direct model inputs, . It follows that if does not have explicit functional dependence on feature , then the off-manifold Shapley value vanishes.

By contrast, in-coalition features affect the on-manifold value function, Eq. (4), through a second channel: implicitly through ’s dependence on when and correlate. The on-manifold Shapley value can thus be nonzero even for a feature that does not act upon directly. In such a case, the model does use information in , but extracts it via other features.

To demonstrate this on the Drug Consumption data from the UCI repository Dua and Graff (2017), we used the 10 binary features listed in Fig. 2 (Mushrooms, Ecstacy, etc.) to predict whether individuals had consumed an 11th drug: LSD. We modelled this data using a shallow decision tree with only 3 nodes.

Fig. 2(a) shows local explanations of the decision tree’s prediction “LSD = True” for a test-set individual with features listed on the horizontal axis. The explanations are computed using Monte Carlo approximations to Eq. (1), with value functions approximated using the empirical distribution (accessible in this case, with just 10 binary features). Note that the off-manifold Shapley values are nonzero only for the 3 features that depends on explicitly, while the on-manifold explanation indicates ’s implicit dependence on information contained in all features.

Fig. 2(b) shows global Shapley values for this shallow decision tree. These global explanations are the expectation values of local explanations, as in Eq. (5). Note that all on-manifold global Shapley values are non-negative, consistent with their interpretation as the portion of model accuracy attributable to the information contained in each feature.

3.2 The garbage-in, garbage-out problem

While Sec. 3.1

might lead one to believe that off-manifold explainability provides useful insight into the functional dependence of a model, it serves as a perilously uncontrolled approach, especially in complex nonlinear models such as neural networks. Indeed, it is widely known that machine learning models are not robust to distributional shift

Nguyen et al. (2015); Goodfellow et al. (2015). Still, the off-manifold value function of Eq. (2) evaluates the model outside its domain of validity, where it is untrained and potentially wildly misbehaved, in hopes that an aggregation of such evaluations will be meaningful. This garbage-in, garbage-out problem is the clearest reason to avoid off-manifold Shapley values.

Since this point is understood in the literature Hooker and Mentch (2019); Sundararajan and Najmi (2019), we simply provide an example of this problem in Fig. 1, which shows an example binary MNIST digit LeCun and Cortes (2010), a coalition of pixels, and 5 random splices that would be used in an off-manifold explanation.

3.3 Misleading explanations off manifold

Figure 3: Local and global explanations of decision tree fit to simple synthetic data set.

To demonstrate that off-manifold Shapley values can be misleading in practice, we generated synthetic data according to the process in Fig. 3(a). The data has two binary features and a binary label, all class-balanced. We fit a decision tree to this data: a precise match to Fig. 3(a).

Note that the features and are positively correlated, both with each other and with label . However, with fixed, the likelihood of decreases slightly from to . One might think of as disease severity, as treatment intensity, and as mortality rate.

The local Shapley values for the frequent scenario are plotted in Fig. 3(b). We find the negative off-manifold Shapley value shown for to be misleading, as it would suggest that the observation is more commonly associated with a prediction of (a occurrence), rather than the true label (). The negative value of is due to the model’s decreased confidence in when one goes from to , as well as from to . This misleading sign is therefore due to ’s heavy sensitivity to the model’s behaviour when , despite this being exceedingly rare in the data.

Fig. 3(c) displays global Shapley values for this model. Note that the on-manifold global values are positive, consistent with their interpretation as the portion of model accuracy attributable to each feature. However, there is a negative off-manifold global value that results from aggregating wrong-sign local explanations. Such a negative value would indicate that input is actually detrimental to the model’s overall performance, which of course is not the case.

3.4 On-manifold Shapley in the non-parametric limit

Here we present a result that strengthens the connection between on-manifold Shapley values and the data distribution: in the limit of a perfect model of the data, on-manifold Shapley values converge to an explanation of how the information in the data associates with the labelled outcomes.

To show why this holds, suppose the predicted probability converges to the true underlying distribution . In this non-parametric limit, the on-manifold value function of Eq. (4) becomes


in which case on-manifold value is attributed to based on ’s predictivity of the label .

To demonstrate this empirically, we fit a random forest to the Drug Consumption data and plotted its off- and on-manifold global Shapley values in Fig. 2(c). Next we fit a separate random forest to each coalition of features, models in total, as in Štrumbelj et al. (2009). We used the accuracy of each model – in the sense of Sec. 2.3 – as the value function for an additional Shapley computation:


where is directly the average gain in accuracy that results from adding feature to the set of inputs. These values are labelled “Retrained models” in Fig. 2(c). Note their agreement with the on-manifold explanation of fixed model . On-manifold Shapley values thus indicate which features in the data are most predictive of the label.

This consistency check allows us to show that Tree SHAP Lundberg et al. (2018, 2020) does not provide a method for on-manifold explainability. Observe in Fig. 2(c) that Tree SHAP values roughly track the off-manifold explanation: somewhat larger on the most predictive feature and somewhat smaller on the others. This occurs because trees tend to split on high-predictivity features first, and Tree SHAP privileges early-splitting features in an otherwise off-manifold calculation.

4 Scalable approaches to computing on-manifold Shapley values

For the results of Sec. 3, the on-manifold value function, Eq. (4

), was estimated from the empirical data distribution, an approach which is not practical for complex realistic data. In this section, we develop two methods of learning the on-manifold value function: (i) unsupervised learning of the conditional distribution

, and (ii) a supervised technique to learn the value function directly.

4.1 Unsupervised approach

To take an unsupervised approach to the data manifold, one can learn the conditional distribution

that appears in the on-manifold value function. We do this using variational inference and two model components. The first component is a variational autoencoder

Kingma and Welling (2014); Rezende et al. (2014), with encoder and decoder . The second is a masked encoder, , for which the goal is to map the coalition to a distribution in latent space that agrees with the encoder as well as possible. A model of is then provided by the composition:


and a good fit to the data should maximise . A lower bound to its log-likelihood is given by


While could be used on its own as the objective function to learn , this would leave the variational distribution

unconstrained, at odds with our goal of learning a smooth-manifold structure in latent space. This concern can be mitigated by introducing


which regularises by penalising differences from a smooth (e.g. unit normal) prior distribution . We thus include as a regularisation term in our unsupervised objective:

. This objective contains a hyperparameter

that prevents a fair comparison between models trained with different values. A separate metric to judge performance is discussed next.

4.2 Metric for the learnt value function

The unsupervised method of Sec. 4.1 leads to a learnt estimate of the conditional distribution, and thus to an estimate of the on-manifold value function: . With the goal of judging this estimate, consider the following formal quantity:


This quantity is minimal with respect to when , in agreement with the definition, Eq. (4), of the on-manifold value function. We can then quantitatively judge the performance of the unsupervised model by computing


Note that Eq. (13) is precisely Eq. (12) averaged over coalitions drawn from the Shapley sum, features drawn from the data, and labels drawn uniformly over classes. Moreover, the mean-square-error in Eq. (13) is easy to estimate using the empirical distribution and the learnt model , thus providing an unambiguous metric to judge the outcome of the unsupervised approach.

4.3 Supervised approach

The MSE metric of Eq. (13) supports a supervised approach to learning the on-manifold value function directly. We do this by defining a surrogate model that can operate on coalitions of features (e.g. by masking out-of-coalition features) and that is trained to maximise the objective:


As discussed in Sec. 4.2, this objective is maximised as the surrogate model approaches the on-manifold value function of the model-to-be-explained.

4.4 Comparison of approaches

Our implementations of the unsupervised and supervised approaches to on-manifold explainability are summarised in App. A. Both approaches lead to broadly similar results. Fig. 4(a) compares the two techniques on the Drug Consumption data, where explanations are given for the random forest of Sec. 3.4 and compared against the computation using the empirical distribution.

Figure 4: (a) Unsupervised and supervised techniques for on-manifold explainability compared on Drug Consumption data. Global Shapley values for (b) Abalone data and (c) Census Income data.

The unsupervised approach to on-manifold explainability is flexible but untargeted: is data-set-specific but model-agnostic, accommodating explanations for many models trained on the same data. The supervised approach trades flexibility for increased performance: while the technique must be retrained to explain each model, it entails direct minimisation of the MSE.

The supervised method is thus expected to achieve higher accuracy. We confirmed this on all data sets studied in this paper; see Table 1 in App. A

for a numerical comparison. The supervised approach also offers increased stability, leading to a smaller variance in MSE in repeated experiments (cf. Table 

1). The supervised method is more efficient as well: while the unsupervised technique estimates the value function by sampling from , the supervised approach learns the value function directly. As a result, to compute Shapley values for the experiments of Sec. 5

, the supervised method required sampling roughly 10 times fewer coalitions to match the standard-error of the unsupervised method.

5 Experiments and results

Here we demonstrate the practical utility of on-manifold explainability through experiments. All numerical details, including a description of uncertainties, are given in App. B.

5.1 Abalone data

Global Shapley values represent the portion of a model’s accuracy attributable to each feature. To show that staying on manifold is required for this interpretation to be robust, we experimented on Abalone data from the UCI repository Dua and Graff (2017)

. We trained a neural network on the physical characteristics contained in the data to classify abalone as younger than or older than the median age.

Fig. 4(b) displays global Shapley values for this model. While the supervised and unsupervised techniques lead to broadly similar on-manifold explanations, observe the drastic difference that arises off manifold. This is due to the tight correlations between features in the data (4 different weights and 3 lengths) making the data manifold low-dimensional and important.

Notice further that Fig. 4(b) displays negative global Shapley values off manifold, negating their interpretation as portions of the model accuracy attributable to each feature.

5.2 Census Income data

To demonstrate that on-manifold explanations are consistent with correlations that appear in the data, we experimented on UCI Census Income data Dua and Graff (2017)

. We trained an xgboost classifier

Chen and Guestrin (2016) to predict whether an individual’s income exceeds $50k based on demographic features in the data.

Fig. 4(c) displays global Shapley values for this model, using the supervised method for the on-manifold explanation. Note the large discrepancy between the off-manifold Shapley values for marital-status and relationship. These features are strongly correlated (married individuals most often have relationship = husband or wife) and their roughly-equal on-manifold values indicate that these features are nearly identically predictive of the model’s output.

Notice further that the Shapley value for age is significantly larger off-manifold than on. This means that the model heavily relies on age to determine its output, but that age correlates with other features, e.g. marital-status and education, that are also predictive of the model’s output.

5.3 Mnist

Figure 5: (a) Randomly drawn MNIST digits explained on / off manifold. Red / blue pixels indicate positive / negative Shapley values, and the colour scale in each column is fixed. (b) Shapley summand as a function of coalition size – averaged over coalitions, pixels, and the MNIST test set.

To demonstrate on-manifold explainability on higher-dimensional data, we trained a simple feed-forward network on binary MNIST LeCun and Cortes (2010) and explained randomly drawn digits in Fig. 5(a).

Despite having the same sum over pixels – as controlled by Eq. (3) – and explaining the same model prediction, each on-manifold explanation is more concentrated, with more interpretable structure, than its off-manifold counterpart. The handwritten strokes are clearly visible on-manifold, with key off-stroke regions highlighted as well. Off-manifold explanations generally display lower intensities spread less informatively across the digit-region.

These off-manifold explanations are a direct result of splices as in Fig. 1. With such unrealistic input, the model’s output is uncontrolled and uninformative. In fact, it is only on very large coalitions of pixels, subject to minimal splicing, that the model can make intelligent predictions off-manifold. This is confirmed in Fig. 5(b), which shows the average Shapley summand as a function of coalition size on MNIST. Note that primarily large coalitions underpin off-manifold explanations, whereas far fewer pixels are required on-manifold, consistent with the low-dimensional manifold underlying the data.

6 Related work

Within the Shapley paradigm, initial work has been done to produce on-manifold explanations: Aas et al. (2019) (similar to Zintgraf et al. (2017); Gu and Tresp (2019)) explores empirical and distribution-fitting techniques, while Lundberg et al. (2018) takes a tree-specific approach, conditioning out-of-coalition features on in-coalition features appearing earlier in the tree. In contrast to these methods, we compute on-manifold Shapley values with more-scalable methods of learning the data manifold, either through variational inference or supervised learning. Moreover, we show in Fig. 2(c) that Tree SHAP does not remedy the off-manifold problem.

Other on-manifold explainability methods exist as well; see e.g. Chang et al. (2019) and Agarwal et al. (2019). Complementary to our work, these methods apply to images, lie outside the Shapley paradigm, and require generative methods. We focus on general data types, operate within the Shapley framework, and offer a simpler alternative (Sec. 4.3) to generative methods.

7 Conclusion

In this work, we took a careful study of the off-manifold problem in AI explainability. We presented the distinction between on- and off-manifold Shapley values in the conceptually clear setting of tree-based models and low-dimensional data. We then introduced two novel techniques to compute on-manifold Shapley values for any model on any data: one technique learns to impute features on the data manifold, while the other learns the Shapley value function directly. In-so-doing, we provided compelling evidence against the use of off-manifold explainability, and demonstrated that on-manifold Shapley values offer a viable approach to AI explainability in real-world contexts.


This work was developed and experiments were run on the Faculty Platform for machine learning. The authors benefited from discussions with Tom Begley, Markus Kunesch, and John Mansir. DDM was partially supported by UCL’s Centre for Doctoral Training in Data Intensive Science.


  • [1] J. Aas, M. Jullum, and A. Løland (2019) Explaining individual predictions when features are dependent: more accurate approximations to Shapley values. Note: [arXiv:1903.10464] Cited by: §1, §6.
  • [2] C. Agarwal, D. Schonfeld, and A. Nguyen (2019) Removing input features via a generative model to explain their attributions to classifier’s decisions. Note: [arXiv:1910.04256] Cited by: §6.
  • [3] A. Agrawal, J. Gans, and A. Goldfarb (2018)

    Prediction machines: the simple economics of artificial intelligence

    Harvard Business Press. Cited by: §1.
  • [4] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. MÞller (2010) How to explain individual classification decisions. Journal of Machine Learning Research. Cited by: §1.
  • [5] L. Breiman (2001) Random forests. Machine learning. Cited by: §1.
  • [6] C. Chang, E. Creager, A. Goldenberg, and D. Duvenaud (2019) Explaining image classifiers by counterfactual generation. In International Conference on Learning Representations, Cited by: §6.
  • [7] T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In International Conference on Knowledge Discovery and Data Mining, Cited by: §B.3, §1, §5.2.
  • [8] A. Datta, S. Sen, and Y. Zick (2016) Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In IEEE Symposium on Security and Privacy, Cited by: §1.
  • [9] D. Dua and C. Graff (2017) UCI machine learning repository. Note: [] Cited by: §B.1, §B.2, §B.3, §3.1, §5.1, §5.2.
  • [10] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §3.2.
  • [11] J. Gu and V. Tresp (2019) Contextual prediction difference analysis. Note: [arXiv:1910.09086] Cited by: §6.
  • [12] G. Hooker and L. Mentch (2019) Please stop permuting features: an explanation and alternatives. Note: [arXiv:1905.03151] Cited by: §2.2, §3.2.
  • [13] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §B.1.
  • [14] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference on Learning Representations, Cited by: §4.1.
  • [15] I. Kononenko et al. (2010) An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research. Cited by: §1.
  • [16] Y. LeCun and C. Cortes (2010) MNIST database. Note: [] Cited by: §B.4, §3.2, §5.3.
  • [17] S. Lipovetsky and M. Conklin (2001) Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry. Cited by: §1.
  • [18] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. Lee (2020) From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence. Cited by: §1, §2.3, §3.4.
  • [19] S. M. Lundberg, G. G. Erion, and S.-I. Lee (2018) Consistent individualized feature attribution for tree ensembles. Note: [arXiv:1802.03888] Cited by: §3.4, §6.
  • [20] S. M. Lundberg and S.-I. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, Cited by: §B.1, §1, §2.1, §2.2.
  • [21] M. Mase, A. B. Owen, and B. Seiler (2019) Explaining black box decisions by Shapley cohort refinement. Note: [arXiv:1911.00467] Cited by: §2.2.
  • [22] A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §3.2.
  • [23] P. Rasouli and I. C. Yu (2019) Meaningful data sampling for a faithful local explanation method. In Intelligent Data Engineering and Automated Learning, Cited by: §1.
  • [24] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In International Conference on Machine Learning, Cited by: §4.1.
  • [25] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should I trust you: explaining the predictions of any classifier. In International Conference on Knowledge Discovery and Data Mining, Cited by: §1.
  • [26] C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. Cited by: §1.
  • [27] L. S. Shapley (1953) A value for -person games. In Contribution to the theory of games, Cited by: §1, §2.1, §2.1.
  • [28] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In International Conference on Machine Learning, Cited by: §1.
  • [29] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008) Conditional variable importance for random forests. BMC Bioinformatics. Cited by: §1.
  • [30] E. Štrumbelj, I. Kononenko, and M. Robnik-Sikonja (2009) Explaining instance classifications with interactions of subsets of feature values. Data Knowl. Eng.. Cited by: §3.4.
  • [31] E. Štrumbelj and I. Kononenko (2014) Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems. Cited by: §1.
  • [32] M. Sundararajan and A. Najmi (2019) The many Shapley values for model explanation. Note: [arXiv:1908.08474] Cited by: §3.2.
  • [33] J. von Neumann and O. Morgenstern (1944) Theory of games and economic behavior. Princeton University Press. Cited by: §2.1.
  • [34] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling (2017) Visualizing deep neural network decisions: prediction difference analysis. In International Conference on Learning Representations, Cited by: §6.

Appendix A Implementation details

Data set Supervised Unsupervised
Table 1: Performance and stability, with respect to MSE, of supervised and unsupervised approaches to on-manifold explainability.

For the unsupervised approach, we modelled the encoder

as a diagonal normal distribution with mean and variance determined by a neural network:


We modelled the decoder as a product distribution:


where the distribution type (e.g. normal, categorical) of each is chosen per-data-set and each distribution’s parameters are determined by a shared neural network. We modelled the masked encoder as a Gaussian mixture:


To allow to accept variable-size coalitions as input, we simply masked out-of-coalition features with a special value () that never appears in the data.

The unsupervised method has several hyperparameters: which multiplies the regularisation term in Eq. (11), the number of components in Eq. (17), as well the architecture and optimisation of the networks involved. For each experiment in this paper, we tuned hyperparameters to minimise the MSE of Eq. (13) on a held-out validation set; see App. B for numerical details.

For the supervised approach, we modelled using a neural network, again masking out-of-coalition features with to accommodate variable-size coalitions . This method’s hyperparameters, relating to architecture and optimisation, were similarly tuned to minimise validation-set MSE; see App. B for details.

As discussed in Sec. 4.4, the supervised method is expected to achieve a smaller MSE than the supervised approach. We confirmed this on all data sets studied in this paper; see Table 1

for a numerical comparison. In the table, central values indicate the test-set MSE achieved by each method. The uncertainties represent the standard deviation in test-set MSE upon re-training each method with fixed hyperparameters 10 times. This indicates that the supervised method also offers increased stability over the unsupervised approach.

Appendix B Details of experiments

Here we provide numerical details for all experiments presented in the paper.

b.1 Drug Consumption experiment

Several experiments were performed on the Drug Consumption data from the UCI repository [9]. We used 10 binary features from the data set – Mushrooms, Ecstasy, etc., as displayed in Fig. 2 – to predict whether individuals had ever consumed an 11th drug: LSD.

The Shapley values in Fig. 2(a) and Fig. 2(b) describe a single decision tree fit with default sklearn parameters as well as max_depth = 1 and max_features = None. While the data exhibits a class balance, the decision tree achieves 82.0% accuracy on a held-out test set.

Local off-manifold Shapley values in Fig. 2(a) were computed by Monte Carlo sampling permutations to estimate Eq. (1). For each sampled permutation, a random data point was drawn from the test set to estimate the off-manifold value function of Eq. (2). Bar heights in Fig. 2(a) are the means that resulted from Monte Carlo samples per feature. Throughout the paper, error bars represent standard errors of the means.

Local on-manifold Shapley values in Fig. 2(a) were again computed using Monte Carlo samples of Eq. (1), but this time using the on-manifold value function of Eq. (4). For each sampled coalition , a random data point was drawn from the test set, with the crucial requirement that . In the text, we refer to this as empirically estimating the conditional distribution . Such empirical estimation is only possible because this data set has a small number of all-binary features.

Global Shapley values in Fig. 2(b) were similarly computed using Monte Carlo samples of Eq. (5). For each labelled data point sampled from the test set, a single permutation was drawn to estimate Eq. (1) and a single data point was drawn to estimate the value function.

The Shapley values of Fig. 2(c) describe a random forest fit with default sklearn parameters and max_features = None, which achieves 82.2% test-set accuracy. Global off- and on-manifold Shapley values were computed just as in Fig. 2(b). Tree SHAP values were computed with the SHAP package [20] with model_output = margin and feature_perturbation = tree_path_dependent.

The values labelled “Model retraining” in Fig. 2(c) were computed by fitting a separate random forest for each coalition of features in the data set: models in all. We used these models to compute the sum of Eq. (8), where represents a variant of model ’s accuracy: it is the accuracy achieved if one predicts labels by drawing stochastically from ’s predicted probability distribution (as opposed to deterministically drawing the maximum-probability class).

The global on-manifold Shapley values in Fig. 2(c) appear in Fig. 4(a) as well, labelled “Empirical”. Fig. 4(a) also displays on-manifold Shapley values computed using the supervised and unsupervised methods introduced in this paper. As above, these are Monte Carlo estimates of Eq. (5). The supervised method involved training a fully-connected network on the MSE loss of Eq. (14). All neural networks in this paper used 2 flat hidden layers, Adam [13] for optimisation, and a batch size of 256. We scanned over a grid with

hidden layer size (18)
learning rate

choosing the point with minimal MSE on a held-out validation set after 10k epochs of training; see Table 

2. Each supervised value in Fig. 4(a) corresponds to Monte Carlo samples.

The unsupervised method involved training a variational autoencoder to minimise the loss of Sec. 4.1, as described in App. A. The encoder, decoder, and masked encoder were each modelled using fully-connected networks, trained using early stopping, with patience 100. We scanned over a grid of hidden layer sizes and learning rates as in Eq. (18) as well as

latent dimension (19)
latent modes

choosing the point with minimal validation-set MSE; see Table 2. Unsupervised values in Fig. 4(a) correspond to Monte Carlo samples.

b.2 Abalone experiment

For the experiment of Sec. 5.1, we used the Abalone data set from the UCI repository [9]. The data contains 8 features corresponding to physical measurements (see Fig. 4b) which we used to classify abalone as younger than or older than the median age. We trained a neural network to perform this task – with hidden layer size 100, default sklearn parameters, and early stopping – obtaining a test-set accuracy of 78%.

Shapley values in Fig. 4(b) were computed exactly as described in Sec. B.1, except that the supervised method involved training for 5k epochs. Optimised hyperparameters are given in Table 2.

Data set Method Hidden dim. Learn. rate Latent dim. Modes
Drug supervised 512
unsupervised 128 4 1 0.5
Abalone supervised 512
unsupervised 256 2 1 0.05
Census supervised 512
unsupervised 128 8 1 1
Mnist supervised 512
unsupervised 512 16 1 1
Table 2: Optimal hyperparameters found for computing on-manifold Shapley values.

b.3 Census Income experiment

For the experiment of Sec. 5.2, we used the Census Income data set from the UCI repository [9]. The data contains 49k individuals from the 1994 US Census, as well as 13 features (see Fig. 4c) which we used to predict whether annual income exceeded $50k. We trained an xgboost classifier [7] with default parameters, achieving a test-set accuracy of 85% amidst a class balance.

Shapley values in Fig. 4(c) were computed exactly as described in Sec. B.1, except that the supervised method used 5k epochs, and the unsupervised method used patience 50. Optimised hyperparameters are given in Table 2. The on-manifold values in Fig. 4(c) were computed using the supervised method. While the unsupervised method does not appear in the figure, it was performed to complete Table 1.

b.4 MNIST experiment

In Sec. 5.3, we used binary MNIST [16]. We trained a fully-connected network – with hidden layer size 512, default parameters, and early stopping – achieving 98% test-set accuracy.

The digits in Fig. 5(a) were randomly drawn from the test set. Shapley values in Fig. 5(a) were computed exactly as described in Sec. B.1, except that the supervised method involved training for 2k epochs, and the on-manifold explanations are based on 16k Monte Carlo samples per pixel. Optimised hyperparameters are given in Table 2. The on-manifold explanations in Fig. 5(a) were computed using the supervised method. While the unsupervised method does not appear in the figure, it was performed to complete Table 1.

The average uncertainty, which is not shown in Fig. 5(a), is roughly 0.002 – stated as a fraction of the maximum Shapley value in each image.