Feature Removal Is a Unifying Principle for Model Explanation Methods

by   Ian Covert, et al.
University of Washington

Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We examine the literature and find that many methods are based on a shared principle of explaining by removing - essentially, measuring the impact of removing sets of features from a model. These methods vary in several respects, so we develop a framework for removal-based explanations that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 25 existing methods, including several of the most widely used approaches (SHAP, LIME, Meaningful Perturbations, permutation tests). Exposing the fundamental similarities between these methods empowers users to reason about which tools to use and suggests promising directions for ongoing research in model explainability.



There are no comments yet.


page 2

page 13


Explaining by Removing: A Unified Framework for Model Explanation

Researchers have proposed a wide variety of model explanation approaches...

Aggregating explainability methods for neural networks stabilizes explanations

Despite a growing literature on explaining neural networks, no consensus...

Topological Representations of Local Explanations

Local explainability methods – those which seek to generate an explanati...

Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals

Feature importance (FI) estimates are a popular form of explanation, and...

On the overlooked issue of defining explanation objectives for local-surrogate explainers

Local surrogate approaches for explaining machine learning model predict...

Considering Likelihood in NLP Classification Explanations with Occlusion and Language Modeling

Recently, state-of-the-art NLP models gained an increasing syntactic and...

Statistical Learning for Best Practices in Tattoo Removal

The causes behind complications in laser-assisted tattoo removal are cur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The proliferation of black-box models has made machine learning (ML) explainability an increasingly important subject, and researchers have now proposed a wide variety of model explanation approaches

Breiman (2001); Chen et al. (2018b); Covert et al. (2020); Lundberg and Lee (2017); Owen (2014); Petsiuk et al. (2018); Ribeiro et al. (2016); Sundararajan et al. (2017); Štrumbelj et al. (2009); Zeiler and Fergus (2014). Despite progress in the field, the relationships and trade-offs among these methods have not been rigorously investigated, and researchers have not always formalized their fundamental ideas about how to interpret models (Lipton, 2018). This makes the literature difficult to navigate and raises questions about whether existing methods relate to human processes for explaining complex decisions (Miller et al., 2017; Miller, 2019).

Here, we present a comprehensive framework that unifies a substantial portion of the model explanation literature. Our framework is based on the observation that many methods can be understood as simulating feature removal to quantify each feature’s influence on a model. The intuition behind these methods is similar (depicted in Figure 1), but each one takes a slightly different approach to the removal operation: some replace features with neutral values (Petsiuk et al., 2018; Zeiler and Fergus, 2014), others marginalize over a distribution of values (Lundberg and Lee, 2017; Strobl et al., 2008), and still others train separate models for each subset of features (Lipovetsky and Conklin, 2001; Štrumbelj et al., 2009). These methods also vary in other respects, as we describe below.

We refer to this class of approaches as removal-based explanations and identify 25111This total count does not include minor variations on the approaches we identified. existing methods that rely on the feature removal principle, including several of the most widely used methods (SHAP, LIME, Meaningful Perturbations, permutation tests). We then develop a framework that shows how each method arises from various combinations of three choices: 1) how the method removes features from the model, 2) what model behavior the method analyzes, and 3) how the method summarizes each feature’s influence on the model. By characterizing each method in terms of three precise mathematical choices, we are able to systematize their shared elements and reveal that they rely on the same fundamental approach—feature removal.


Figure 1: A unified framework for removal-based explanations. Each method is determined by three choices: how it removes features, what model behavior it analyzes, and how it summarizes feature influence.

The model explanation field has grown significantly in the past decade, and we take a broader view of the literature than existing unification theories. Our framework’s flexibility enables us to establish links between disparate classes of methods (e.g., computer vision-focused methods, global methods, game-theoretic methods, feature selection methods) and show that the literature is more interconnected than previously recognized. Exposing these underlying connections potentially raises questions about the degree of novelty in recent work, but we also believe that each method has the potential to offer unique advantages, either computationally or conceptually.

Through this work, we hope to empower users to reason more carefully about which tools to use, and we aim to provide researchers with new theoretical tools to build on in ongoing research. Our contributions include:

  1. [leftmargin=2pc]

  2. We present a framework that unifies 25 existing explanations methods. Our framework for removal-based explanations integrates classes of methods that were previously considered disjoint, including local and global approaches as well as feature attribution and feature selection methods.

  3. We develop new mathematical tools to represent different approaches to removing features from brittle ML models. Subset functions and model extensions provide a common representation for various feature removal techniques, revealing that this choice is interchangeable between methods.

  4. We generalize numerous explanation methods to express them within our framework, exposing connections that were not apparent in the original works. In particular, for several approaches we disentangle the substance of the methods from the approximations that make them practical.

We begin with background on the model explanation problem and a review of prior work (Section 2), and we then introduce our framework (Section 3). The next several sections examine our framework in detail by showing how it encompasses existing methods. Section 4 discusses how methods remove features, Section 5 formalizes the model behaviors analyzed by each method, and Section 6 describes each method’s approach to summarizing feature influence. Finally, Section 7 concludes and discusses future research directions.

2 Background

Here, we introduce the model explanation problem and briefly review existing approaches and related unification theories.

2.1 Preliminaries

Consider a supervised ML model

that is used to predict a response variable

using the input , where each represents an individual feature, such as a patient’s age. We use uppercase symbols (e.g.,

) to denote random variables and lowercase ones (e.g.,

) to denote their values. We also use

to denote the domain of the full feature vector

and to denote the domain of each feature . Finally, denotes a subset of features for , and represents a set’s complement.

ML interpretability broadly aims to provide insight into how models make predictions. This is particularly important when

is a complex model, such as a neural network or a decision forest. The most active area of research in the field is

local interpretability, which explains individual predictions, such as an individual patient diagnosis Lundberg and Lee (2017); Ribeiro et al. (2016); Sundararajan et al. (2017); in contrast, global interpretability explains the model’s behavior across the entire dataset Breiman (2001); Covert et al. (2020); Owen (2014). Both problems are usually addressed using feature attribution, where a score is assigned to explain each feature’s influence. However, recent work has also proposed the strategy of local feature selection (Chen et al., 2018b), and other papers have introduced methods to isolate sets of relevant features (Dabkowski and Gal, 2017; Fong and Vedaldi, 2017; Zhou et al., 2014).

Whether the aim is local or global interpretability, explaining the inner workings of complex models is fundamentally difficult, so it is no surprise that researchers keep devising new approaches. Commonly cited categories of approaches include perturbation-based methods (Lundberg and Lee, 2017; Zeiler and Fergus, 2014), gradient-based methods (Simonyan et al., 2013; Sundararajan et al., 2017), and inherently interpretable models (Rudin, 2019; Zhou et al., 2016). However, these categories refer to loose collections of approaches that seldom share a precise mechanism.

Besides the inherently interpretable models, virtually all of these approaches generate explanations by considering some class of perturbation to the input and using the outcomes to explain each feature’s influence. Certain methods consider infinitesimal perturbations by calculating gradients (Simonyan et al., 2013; Smilkov et al., 2017; Sundararajan et al., 2017; Xu et al., 2020), but there are many possible perturbations (Fong and Vedaldi, 2017; Lundberg and Lee, 2017; Ribeiro et al., 2016; Zeiler and Fergus, 2014). Our work is based on the observation that numerous perturbation strategies can be understood as simulating feature removal.

2.2 Related work

Prior work has made solid progress in exposing connections among disparate explanation methods. Lundberg and Lee proposed the unifying framework of additive feature attribution methods and showed that LIME, DeepLIFT, LRP and QII are all related to SHAP (Bach et al., 2015; Datta et al., 2016; Lundberg and Lee, 2017; Ribeiro et al., 2016; Shrikumar et al., 2016)

. Similarly, Ancona et al. showed that Grad * Input, DeepLIFT, LRP and Integrated Gradients are all understandable as modified gradient backpropagations 

Ancona et al. (2017); Shrikumar et al. (2016); Sundararajan et al. (2017). Most recently, Covert et al. showed that several global explanation methods can be viewed as additive importance measures, including permutation tests, Shapley Net Effects, and SAGE (Breiman, 2001; Covert et al., 2020; Lipovetsky and Conklin, 2001).

Relative to prior work, the unification we propose is considerably broader but nonetheless precise. As we describe below, our framework characterizes methods along three dimensions. The choice of how to remove features has been considered by many works (Aas et al., 2019; Frye et al., 2020; Hooker and Mentch, 2019; Janzing et al., 2019; Lundberg and Lee, 2017; Merrick and Taly, 2019; Sundararajan and Najmi, 2019; Chang et al., 2018; Agarwal and Nguyen, 2019). The choice of what model behavior to analyze has been considered explicitly by only a few works (Covert et al., 2020; Lundberg et al., 2020), as has the choice of how to summarize each feature’s influence based on a set function (Covert et al., 2020; Datta et al., 2016; Frye et al., 2019; Lundberg and Lee, 2017; Štrumbelj et al., 2009). To our knowledge, ours is the first work to consider all three dimensions simultaneously and unite them under a single framework.

Besides the methods that we focus on, there are also methods that do not rely on the feature removal principle. We direct readers to survey articles for a broader overview of the literature Adadi and Berrada (2018); Guidotti et al. (2018).

3 Removal-Based Explanations

We now introduce our framework and briefly describe the methods it unifies.

Method Removal Behavior Summary
IME (2009) Separate models Prediction Shapley value
IME (2010) Marginalize (uniform) Prediction Shapley value
QII Marginalize (marginals product) Prediction Shapley value
SHAP Marginalize (conditional/marginal) Prediction Shapley value
KernelSHAP Marginalize (marginal) Prediction Shapley value
TreeSHAP Tree distribution Prediction Shapley value
LossSHAP Marginalize (conditional) Prediction loss Shapley value
SAGE Marginalize (conditional) Dataset loss (label) Shapley value
Shapley Net Effects Separate models Dataset loss (label) Shapley value
Shapley Effects Marginalize (conditional) Dataset loss (output) Shapley value
Permutation Test Marginalize (marginal) Dataset loss (label) Remove individual
Conditional Perm. Test Marginalize (conditional) Dataset loss (label) Remove individual
Feature Ablation (LOCO) Separate models Dataset loss (label) Remove individual
Univariate Predictors Separate models Dataset loss (label) Include individual
L2X Missingness during training Prediction mean loss High-value subset
INVASE Missingness during training Prediction mean loss High-value subset
LIME (Images) Default values Prediction Linear model
LIME (Tabular) Marginalize (replacement dist.) Prediction Linear model
PredDiff Marginalize (conditional) Prediction Remove individual
Occlusion Zeros Prediction Remove individual
CXPlain Zeros Prediction loss Remove individual
RISE Zeros Prediction Mean when included
MM Default values Prediction Partitioned subsets
MIR Extend pixel values Prediction High-value subset
MP Blurring Prediction Low-value subset
EP Blurring Prediction High-value subset
FIDO-CA Generative model Prediction High-value subset
Table 1: Choices made by existing removal-based explanations.

3.1 A unified framework

We develop a unified model explanation framework by connecting methods that define a feature’s influence through the impact of removing it from a model. This perspective encompasses a substantial portion of the explainability literature: we find that 25 existing methods rely on this mechanism, including many of the most widely used approaches (Breiman, 2001; Fong and Vedaldi, 2017; Lundberg and Lee, 2017; Ribeiro et al., 2016).

These methods all remove groups of features from the model, but, beyond that, they take a diverse set of approaches. For example, LIME fits a linear model to an interpretable representation of the input (Ribeiro et al., 2016), L2X selects the most informative features for a single example (Chen et al., 2018b)

, and Shapley Effects examines how much of the model’s variance is explained by each feature 

(Owen, 2014). Perhaps surprisingly, their differences are easy to systematize because each method removes discrete sets of features.

As our main contribution, we introduce a framework that shows how these methods can be specified using only three choices.

Definition 1.

Removal-based explanations are model explanations that quantify the impact of removing sets of features from the model. These methods are determined by three choices:

  1. (Feature removal) How the method removes features from the model (e.g., by setting them to default values or by marginalizing over a distribution of values)

  2. (Model behavior) What model behavior the method analyzes (e.g., the probability of the true class or the model loss)

  3. (Summary technique) How the method summarizes each feature’s impact on the model (e.g., by removing a feature individually or by calculating the Shapley values)

This precise yet flexible framework represents each choice as a specific type of mathematical function, as we show later. The framework unifies disparate explanation methods, and, by unraveling each method’s choices, offers a step towards a better understanding of the literature by allowing explicit reasoning about the trade-offs among different approaches.

[trim=1.4cm 0.82cm 7.0cm 0.6cm, clip=true, width=]images/methods_grid_extended.pdf

Figure 2: Visual depiction of the space of removal-based explanations.

3.2 Overview of existing approaches

We now outline some of our findings, which we present in more detail in the next several sections. In particular, we preview how existing methods fit into our framework and highlight groups of methods that appear similar in light of our feature removal perspective.

Table 1 lists the methods unified by our framework (with acronyms introduced in the next section). These methods represent diverse parts of the interpretability literature, including global methods (Breiman, 2001; Owen, 2014), computer vision-focused methods (Petsiuk et al., 2018; Zeiler and Fergus, 2014; Zhou et al., 2014; Fong and Vedaldi, 2017), game-theoretic methods (Covert et al., 2020; Lundberg and Lee, 2017; Štrumbelj and Kononenko, 2010) and feature selection methods (Chen et al., 2018b; Fong et al., 2019; Yoon et al., 2018). They are all unified by their reliance on feature removal.

Disentangling the details of each method shows that many approaches share one or more of the same choices. For example, most methods choose to explain individual predictions (the model behavior), and the most popular summary technique is the Shapley value (Shapley, 1953). These common choices raise important questions about how different these methods truly are and how their choices are justified.

To highlight similarities among the methods, we visually depict the space of removal-based explanations in Figure 2. Visualizing our framework reveals several regions in the space of methods that are crowded (e.g., methods that marginalize out removed features with their conditional distribution and that calculate Shapley values), while certain methods are relatively unique and spatially isolated (e.g., RISE; LIME for tabular data; L2X and INVASE). Empty positions in the grid reveal opportunities to develop new methods; in fact, every empty position represents a viable new explanation method.

Removal Behavior Summary Methods
SHAP, LossSHAP, SAGE, Shapley Effects
Occlusion, LIME (images), MM, RISE
Feature ablation (LOCO), permutation tests, conditional permutation tests
Univariate predictors, feature ablation (LOCO), Shapley Net Effects
SAGE, Shapley Net Effects
SAGE, conditional permutation tests
Shapley Net Effects, IME (2009)
Occlusion, CXPlain
Occlusion, PredDiff
Conditional permutation tests, PredDiff
SHAP, PredDiff
Table 2: Common combinations of choices in existing methods. Check marks (✓) indicate choices that are identical between methods.

Finally, Table 2 shows groups of methods that differ in only one dimension of the framework. These methods are neighbors in the space of explanation methods (Figure 2), and it is remarkable how many instances of neighboring methods exist in the literature. Certain methods even have neighbors along every dimension of the framework (e.g., SHAP, SAGE, Occlusion, PredDiff, conditional permutation tests), reflecting how intimately connected the literature has become. The explainability literature is evolving and maturing, and our perspective provides a new approach for reasoning about the subtle relationships and trade-offs among existing approaches.

4 Feature Removal

Here, we define the mathematical tools necessary to remove features from ML models and then examine how existing explanation methods remove features.

4.1 Functions on subsets of features

Most ML models make predictions given a specific set of features . Mathematically, these models are functions are of the form , and we use to denote the set of all such possible mappings. The principle behind removal-based explanations is to remove certain features to understand their impact on a model, but since most models require all the features to make predictions, removing a feature is more complicated than simply not giving the model access to it.

To remove features from a model, or to make predictions given a subset of features, we require a different mathematical object than . Instead of functions with domain , we consider functions with domain , where denotes the power set of . To ensure invariance to the held out features, these functions must depend only on a set of features specified by a subset , so we formalize subset functions as follows.

Definition 2.

A subset function is a mapping of the form

that is invariant to the dimensions that are not in the specified subset. That is, we have for all such that . We define for convenience because the held out values are not used by .

A subset function’s invariance property is crucial to ensure that only the specified feature values determine the function’s output, while guaranteeing that the other feature values do not matter. Another way of viewing subset functions is that they simulate the presence of missing data. While we use to represent standard prediction functions, we use to denote the set of all possible subset functions.

We introduce subset functions here because they help conceptualize how different methods remove features from ML models. Removal-based explanations typically begin with an existing model , and in order to quantify each feature’s influence, they must establish a convention for removing it from the model. A natural approach is to define a subset function based on the original model . To formalize this idea, we define a model extension as follows.

Definition 3.

An extension of a model is a subset function that agrees with in the presence of all features. That is, the model and its extension must satisfy

As we show next, extending an existing model is the first step towards specifying a removal-based explanation method.

4.2 Removing features from machine learning models

Existing methods have devised numerous ways to evaluate models while withholding groups of features. Although certain methods use different terminology to describe their approaches (e.g., deleting information, ignoring features, using neutral values, etc.), the goal of these methods is to measure a feature’s influence through the impact of removing it from the model. Most proposed techniques can be understood as extensions of an existing model (Definition 3).

We now examine each method’s approach (see Appendix A for more details):

  • [leftmargin=2pc]

  • (Zeros) Occlusion (Zeiler and Fergus, 2014), RISE (Petsiuk et al., 2018) and causal explanations (CXPlain) (Schwab and Karlen, 2019) remove features simply by setting them to zero:

  • (Default values) LIME for image data (Ribeiro et al., 2016) and the Masking Model method (MM) (Dabkowski and Gal, 2017) remove features by setting them to user-defined default values (e.g., gray pixels for images). Given default values , these methods calculate


    This is a generalization of the previous approach, and in some cases features may be given different default values (e.g., their mean).

  • (Missingness during training) Learning to Explain (L2X) (Chen et al., 2018b) and Instance-wise Variable Selection (INVASE) (Yoon et al., 2018) use a model that has missingness introduced at training time. Removed features are replaced with zeros so that the model makes the following approximation:


    This approach differs from Occlusion and RISE because the model is trained to recognize zeros as missing values rather than zero-valued features. A model trained with a loss function other than cross entropy loss would approximate a different quantity (e.g., the conditional expectation

    for MSE loss).

  • (Extend pixel values) Minimal image representation (MIR) (Zhou et al., 2014) removes features in images by extending the values of neighboring pixels. This effect is achieved through a gradient-space manipulation.

  • (Blurring) Meaningful Perturbations (MP) (Fong and Vedaldi, 2017) and Extremal Perturbations (EP) (Fong et al., 2019) remove features from images by blurring them with a Gaussian kernel. This approach is not an extension of because the blurred image retains dependence on the removed features. Blurring fails to remove large, low frequency objects (e.g., mountains), but it provides an approximate way to remove information from images.

  • (Generative model) FIDO-CA Chang et al. (2018) removes feature by replacing them with a sample from a conditional generative model (e.g. Yu et al. (2018)). The held out features are drawn from a generative model represented by , or and predictions are made as follows:

  • (Marginalize with conditional) SHAP (Lundberg and Lee, 2017), LossSHAP (Lundberg et al., 2020) and SAGE (Covert et al., 2020) present a strategy for removing features by marginalizing them out using their conditional distribution :


    This approach is computationally challenging in practice, but recent work tries to achieve close approximations (Aas et al., 2019; Frye et al., 2020). Shapley Effects (Owen, 2014) implicitly uses this convention to analyze function sensitivity, while conditional permutation tests (Strobl et al., 2008) and Prediction Difference Analysis (PredDiff) (Zintgraf et al., 2017) do so to remove individual features.

  • (Marginalize with marginal) KernelSHAP (a practical implementation of SHAP) Lundberg and Lee (2017) removes features by marginalizing them out using their joint marginal distribution :


    This is the default behavior in SHAP’s implementation,222https://github.com/slundberg/shap and recent work discusses the benefits of this approach (Janzing et al., 2019). Permutation tests (Breiman, 2001) use this approach to remove individual features from a model.

  • (Marginalize with product of marginals) Quantitative Input Influence (QII) (Datta et al., 2016) removes held out features by marginalizing them out using the product of the marginal distributions :

  • (Marginalize with uniform) The updated version of the Interactions Method for Explanation (IME) (Štrumbelj and Kononenko, 2010)

    removes features by marginalizing them out with a uniform distribution over the feature space. If we let

    denote a uniform distribution over (with extremal values defining the boundaries for continuous features), then features are removed as follows:

  • (Marginalize with replacement distributions) LIME for tabular data replaces features with independent draws from replacement distributions (our term), each of which depends on the original feature values. When a feature with value is removed, discrete features are drawn from the distribution ; when quantization is used for continuous features (LIME’s default behavior333https://github.com/marcotcr/lime

    ), continuous features are simulated by first generating a different quantile and then simulating from a truncated normal distribution within that bin. If we denote each feature’s replacement distribution given the original value

    as , then LIME for tabular data removes features as follows:


    Although this function agrees with given all features, it is not an extension because it does not satisfy the invariance property for subset functions.

  • (Tree distribution) Dependent TreeSHAP (Lundberg et al., 2020) removes features using the distribution induced by the model, which roughly approximates the conditional distribution. When splits for removed features are encountered in the model’s trees, TreeSHAP averages predictions from the multiple paths in proportion to how often the dataset follows each path.

  • (Separate models) Shapley Net Effects (Lipovetsky and Conklin, 2001) and the original version of IME (Štrumbelj et al., 2009) are not based on a single model but rather on separate models trained for each subset, which we denote as . The prediction for a subset of features is given by that subset’s model:


    Although this approach is technically an extension of the model trained with all features, its predictions given subsets of features are not based on . Similarly, feature ablation, also known as leave-one-covariate-out (LOCO) (Lei et al., 2018), trains models to remove individual features, and the univariate predictors approach (used mainly for feature selection) uses models trained with individual features (Guyon and Elisseeff, 2003).

Most of these approaches are extensions of an existing model , so our formalisms provide useful tools for understanding how removal-based explanations remove features from models. However, consider two exceptions: the blurring technique (MP and EP) and LIME’s approach with tabular data. Both provide functions of the form that agree with given all features, but that still exhibit dependence on removed features. Based on our mathematical characterization of subset functions and their invariance to held out features, we argue that these two approaches do not fully remove features from the model. We conclude that the first dimension of our framework amounts to choosing an extension of the model .

5 Explaining Different Model Behaviors

Removal-based explanations all aim to demonstrate how a model works, but they can do so by analyzing a variety of model behaviors. We now consider the various choices of target quantities to observe as different features are withheld from the model.

The feature removal principle is flexible enough to explain virtually any function. For example, methods can explain a model’s prediction, a model’s loss function, a hidden layer in a neural network, or any node in a computation graph. In fact, removal-based explanations need not be restricted to the ML context: any function that accommodates missing inputs can be explained via feature removal by examining either its output or some function of its output as groups of inputs are removed. This perspective shows the broad potential applications for removal-based explanations.

However, since our focus is the ML context, we proceed by examining how existing methods work. Each method’s target quantity can be understood as a function of the model output, which is represented by a subset function

. Many methods explain the model output or a simple function of the output, such as the log-odds ratio. Other methods take into account a measure of the model’s loss, for either an individual input or the entire dataset. Ultimately, as we show below, each method generates explanations based on a set function of the form

which represents a value associated with each subset of features . This set function represents the model behavior that a method is designed to explain.

We now examine the specific choices made by existing methods (see Appendix A for further details on each method). The various model behaviors that methods analyze, and their corresponding set functions, include:

  • [leftmargin=2pc]

  • (Prediction) Occlusion, RISE, PredDiff, MP, EP, MM, FIDO-CA, MIR, LIME, SHAP (including KernelSHAP and TreeSHAP), IME and QII all analyze a model’s prediction for an individual input :


    These methods quantify how holding out different features makes an individual prediction either higher or lower. For multi-class classification models, methods often use a single output that corresponds to the class of interest, and they can also apply a simple function to the model’s output (for example, using the log-odds ratio rather than classification probability).

  • (Prediction loss) LossSHAP and CXPlain take into account the true label for an input and calculate the prediction loss using a loss function :


    By incorporating label information, these methods quantify whether certain features make the prediction more or less correct. The minus sign is necessary to give the set function a higher value when more informative features are included.

  • (Prediction mean loss) L2X and INVASE consider the expected loss for a given input according to the label’s conditional distribution :


    By averaging the loss across the label’s distribution, these methods highlight features that correctly predict what could have occurred, on average.

  • (Dataset loss w.r.t. label) Shapley Net Effects, SAGE, feature ablation (LOCO), permutation tests and univariate predictors consider the expected loss across the entire dataset:


    These methods quantify how much the model’s performance degrades when different features are removed. This set function can also be viewed as the predictive power derived from sets of features (Covert et al., 2020). Recent work has proposed a SHAP value aggregation scheme that can be considered a special case of this approach Frye et al. (2020).

  • (Dataset loss w.r.t. output) Shapley Effects considers the expected loss with respect to the full model output:


    Though related to the previous approach (Covert et al., 2020), Shapley Effects focuses on each feature’s influence on the model output rather than on the model performance.

Each set function serves a distinct purpose in exposing a model’s dependence on different features. The first three approaches listed above analyze the model’s behavior for individual predictions (local explanations); the last two take into account the model’s behavior across the entire dataset (global explanations). Although their aims differ, these set functions are all in fact related. Each builds upon the previous ones by accounting for either the loss or data distribution, and their relationships can be summarized as follows:


These relationships show that explanations based on one set function are in some cases related to explanations based on another. For example, Covert et al. showed that SAGE explanations are the expectation of explanations provided by LossSHAP Covert et al. (2020)—a relationship reflected in Eq. 18.

Understanding these connections is possible only because our framework disentangles each method’s choices rather than viewing each method as a monolithic algorithm. We conclude by reiterating that removal-based explanations can explain virtually any function, and that choosing what to explain amounts to selecting a set function to represent the model’s dependence on different sets of features.

6 Summarizing Feature Influence

The third choice for removal-based explanations is how to summarize each feature’s influence on the model. We examine the various summarization techniques and then discuss their computational complexity and approximation approaches.

6.1 Explaining set functions

The set functions we used to represent a model’s dependence on different features (Section 5) are complicated mathematical objects that are difficult to communicate fully due to the exponential number of feature subsets and underlying feature interactions. Removal-based explanations confront this challenge by providing users with a concise summary of each feature’s influence.

We distinguish between two main types of summarization approaches: feature attributions and feature selections. Many methods provide explanations in the form of feature attributions, which are numerical scores given to each feature . If we use to denote the set of all functions , then we can represent feature attributions as mappings of the form , which we refer to as explanation mappings. Other methods take the alternative approach of summarizing set functions with a set of the most influential features. We represent these feature selection summaries as explanation mappings of the form . Both approaches provide users with simple summaries of a feature’s contribution to the set function.

We now consider the specific choices made by each method (see Appendix A for further details). For simplicity, we let denote the set function each method analyzes. Surveying the various removal-based explanation methods, the techniques for summarizing each feature’s influence include:

  • [leftmargin=2pc]

  • (Remove individual) Occlusion, PredDiff, CXPlain, permutation tests and feature ablation (LOCO) calculate the impact of removing a single feature from the set of all features, resulting in the following attribution values:


    Occlusion, PredDiff and CXPlain can also be applied with groups of features in image contexts.

  • (Include individual) The univariate predictors approach calculates the impact of including individual features, resulting in the following attribution values:


    This is essentially the reverse of the previous approach: while that approach removes individual features from the complete set, this one adds individual features to the empty set.

  • (Linear model) LIME fits a regularized weighted linear model to a dataset of perturbed examples. In the limit of an infinitely large dataset, this process approximates the following attribution values:


    In this problem, represents a weighting kernel and is a regularization function that is often set to the penalty to encourage sparse attributions Tibshirani (1996). Since this summary is based on an additive model, the learned coefficients represent values associated with including each feature.

  • (Mean when included) RISE determines feature attributions by sampling many subsets and then calculating the mean value when a feature is included. Denoting the distribution of subsets as and the conditional distribution as , the attribution values are defined as


    In practice, RISE samples the subsets by removing each feature independently with probability , using in the original experiments (Petsiuk et al., 2018).

  • (Shapley value) Shapley Net Effects, IME, Shapley Effects, QII, SHAP (including KernelSHAP, TreeSHAP and LossSHAP) and SAGE all calculate feature attribution values using the Shapley value, which we denote as . Shapley values are the only attributions that satisfy a number of desirable properties (Shapley, 1953).

  • (Low-value subset) MP selects a small set of features that can be removed to give the set function a low value. It does so by solving the following optimization problem:


    In practice, MP uses additional regularizers and solves a relaxed version of this problem (see Section 6.2).

  • (High-value subset) MIR solves an optimization problem to select a small set of features that alone can give the set function a high value. For a user-defined minimum value , the problem is given by:


    L2X and EP solve a similar problem but switch the terms in the constraint and optimization objective. For a user-defined subset size , the optimization problem is given by:


    Finally, INVASE and FIDO-CA solve a regularized version of the problem with a parameter controlling the trade-off between the subset value and subset size:

  • (Partitioned subsets) MM solves an optimization problem to partition the features into and while maximizing the difference in the set function’s values. This approach is based on the idea that removing features to find a low-value subset (as in MP) and retaining features to get a high-value subset (as in MIR, L2X, EP, INVASE and FIDO-CA) are both reasonable approaches for identifying influential features. The problem is given by:


    In practice, MM incorporates regularizers and monotonic link functions to enable a more flexible trade-off between and (see Appendix A).

As this discussion shows, every removal-based explanation generates summaries of each feature’s influence on the underlying set function. In general, a model’s dependencies are too complex to communicate fully, so explanations must provide users with a concise summary instead. As noted, most methods we discuss generate feature attributions, but several others generate explanations by selecting the most important features. These feature selection explanations are essentially coarse attributions that assign binary importance rather than a real number.

Interestingly, if the high-value subset optimization problems solved by MIR, L2X, EP, INVASE and FIDO-CA were applied to the set function that represents the dataset loss (Eq. 18), they would resemble conventional global feature selection problems (Guyon and Elisseeff, 2003). The problem in Eq. 26 determines the set of features with maximum predictive power, the problem in Eq. 25 determines the smallest possible set of features that achieve the performance represented by , and the problem in Eq. 27 uses a parameter to control the trade-off. Though not generally viewed as a model explanation approach, global feature selection serves an identical purpose of identifying highly predictive features.

We conclude by reiterating that the third dimension of our framework amounts to a choice of explanation mapping, which takes the form for feature attribution or for feature selection. Our discussion so far has shown that removal-based explanations can be specified using three precise mathematical choices, as depicted in Figure 3. These methods, which are often presented in ways that make their connections difficult to discern, are constructed in a remarkably similar fashion.


Figure 3: Removal-based explanations are specified by three precise mathematical choices: a subset function , a set function , and an explanation mapping (for feature attribution or selection).

6.2 Complexity and approximations

Showing how certain explanation methods fit into our framework requires distinguishing between their substance and the approximations that make them practical. Our presentation of these methods deviates from the original papers, which often focus on details of a method’s implementation. We now bridge the gap by describing these methods’ significant computational complexity and the approximations they use out of necessity.

The challenge with most summarization techniques described above is that they require calculating the underlying set function’s value for many subsets of features. In fact, without making any simplifying assumptions about the model or data distribution, several techniques must examine all subsets of features. This includes the Shapley value, RISE’s summary technique and LIME’s linear model. Finding exact solutions to several of the optimization problems (MP, MIR, MM, INVASE, FIDO-CA) also requires examining all subsets of features, and solving the constrained optimization problem (EP, L2X) for features requires examining subsets, or subsets in the worst case.444This can be seen by applying Stirling’s approximation to as becomes large.

The only approaches with lower computational complexity are those that remove individual features (Occlusion, PredDiff, CXPlain, permutation tests, feature ablation) or include individual features (univariate predictors). These require only one subset per feature, or total feature subsets.

Many summarization techniques have superpolynomial complexity in , making them intractable for large numbers of features. However, these methods work in practice due to fast approximation approaches, and in some cases methods have even been devised to generate explanations in real-time. Strategies that yield fast approximations include:

  • [leftmargin=2pc]

  • Attribution values that are the expectation of a random variable can be estimated using Monte Carlo approximations. IME 

    (Štrumbelj and Kononenko, 2010), Shapley Effects (Song et al., 2016) and SAGE (Covert et al., 2020) use sampling strategies to approximate Shapley values, and RISE also estimates its attributions via sampling (Petsiuk et al., 2018).

  • KernelSHAP and LIME are both based on linear regression models fitted to datasets containing an exponential number of datapoints. In practice, these techniques fit models to smaller sampled datasets, which means optimizing an approximate version of their objective function.

  • TreeSHAP calculates Shapley values in polynomial time using a dynamic programming algorithm that exploits the structure of tree-based models. Similarly, L-Shapley and C-Shapley exploit the properties of models for structured data to provide fast Shapley value approximations (Chen et al., 2018a).

  • Several of the feature selection methods (MP, L2X, EP, MM, FIDO-CA) solve continuous relaxations of their discrete optimization problems. While these optimization problems could be solved by representing the set of features as a mask , these methods instead use a mask variable of the form .

  • One feature selection method (MIR) uses a greedy optimization algorithm. MIR determines a set of influential features by iteratively removing groups of features that do not reduce the predicted probability for the correct class.

  • One feature attribution method (CXPlain) and three feature selection methods (L2X, INVASE, MM) generate real-time explanations by learning separate explainer models. CXPlain learns an explainer model using a dataset consisting of manually calculated explanations, which removes the need to iterate over each feature after training. L2X learns a model that outputs a set of features (represented by a -hot vector) and INVASE learns a similar selector model that can output an arbitrary number of features; similarly, MM learns a model that outputs masks of the form for images. These techniques can be viewed as amortized optimization approaches (Shu, 2017) because they learn models that output approximate solutions in a single forward pass (similar to amortized inference Kingma and Welling (2013)).

In conclusion, many methods provide efficient explanations despite using summarization techniques that are inherently intractable. Each approximation significantly speeds up computation relative to a brute-force calculation, but we predict that more approaches could be made to run in real-time by learning explainer models, as in the MM, L2X, INVASE and CXPlain approaches (Chen et al., 2018b; Dabkowski and Gal, 2017; Schwab and Karlen, 2019; Yoon et al., 2018).

7 Discussion

In this work, we developed a unified framework that characterizes a significant portion of the model explanation literature (25 existing methods). Removal-based explanations have a great degree of flexibility, and we systematized their differences by showing that each method is specified by three precise mathematical choices:

  1. [leftmargin=2pc]

  2. How the method removes features. Each method specifies a subset function to make predictions with subsets of features, often based on an existing model .

  3. What model behavior the method analyzes. Each method implicitly relies on a set function to represent the model’s dependence on different groups of features. The set function describes the model’s behavior either for an individual prediction or across the entire dataset.

  4. How the method summarizes each feature’s influence. Methods generate explanations that provide a concise summary of each feature’s contribution to the set function . Mappings of the form generate feature attribution explanations, and mappings of the form generate feature selection explanations.

The growing interest in black-box ML models has spurred a remarkable amount of model explanation research, and in the past decade we have seen a number of publications proposing innovative new methods. However, as the field has matured we have also seen a growing number of unifying theories that reveal underlying similarities and implicit relationships Ancona et al. (2017); Covert et al. (2020); Lundberg and Lee (2017). Our framework for removal-based explanations is perhaps the broadest unifying theory yet, and it bridges the gap between disparate parts of the explainability literature.

An improved understanding of the field presents new opportunities for both explainability users and researchers. For users, we hope that our framework will allow for more explicit reasoning about the trade-offs between available explanation tools. The unique advantages of different methods are difficult to understand when they are viewed as monolithic algorithms, but disentangling their choices makes it simpler to reason about their strengths and weaknesses.

For researchers, our framework offers several promising directions for future work. We identify three key areas that can be explored to better understand the trade-offs between different removal-based explanations:

  • [leftmargin=2pc]

  • Several of the methods characterized by our framework can be interpreted using ideas from information theory Chen et al. (2018b); Covert et al. (2020). We suspect that other methods can be understood with an information-theoretic perspective and that this may shed light on whether there are theoretically justified choices for each dimension of our framework.

  • As we showed in Section 5, every removal-based explanation is based on an underlying set function that represents the model’s behavior. Set functions can be viewed as cooperative games, and we suspect that methods besides those that use Shapley values Covert et al. (2020); Datta et al. (2016); Lundberg and Lee (2017); Owen (2014); Štrumbelj et al. (2009)

    can be related to techniques from cooperative game theory.

  • Finally, it is remarkable that so many researchers have developed, with some degree of independence, explanation methods based on the same feature removal principle. We speculate that cognitive psychology may shed light on why this represents a natural approach to explaining complex decision processes. This would be impactful for the field because, as recent work has pointed out, explainability research is surprisingly disconnected from the social sciences Miller (2019); Miller et al. (2017).

In conclusion, as the field evolves and the number of removal-based explanations continues to grow, we hope that our framework can serve as a foundation upon which future research can build.

Appendix A Method Details

Here, we provide additional details about some of the explanation methods discussed in the main text. In several cases, we presented generalized versions of methods that deviated from their explanations in the original papers.

a.1 Meaningful Perturbations (MP)

Meaningful Perturbations (Fong and Vedaldi, 2017) considers multiple ways of deleting information from an input image, and the approach it recommends is a blurring operation. Given a mask , MP uses a function to denote the modified input and suggests that the mask may be used to 1) set pixels to a constant value, 2) replace them with Gaussian noise, or 3) blur the image. In the blurring approach, each pixel

is blurred separately using a Gaussian kernel with standard deviation given by

(for a user specified ).

To prevent adversarial solutions, MP incorporates a total variation norm on the mask, upsamples it from a low-resolution version, and uses a random jitter on the image during optimization. Additionally, MP uses a continuous mask in place of a binary mask and the penalty on the mask in place of the penalty. Although MP’s optimization tricks are key to providing visually compelling explanations, our presentation focuses on the most essential part of the optimization objective, which is reducing the classification probability while blurring only a small part of the image (Eq. 24).

a.2 Extremal Perturbations (EP)

Extremal Perturbations (Fong et al., 2019) is an extension of MP with several modifications. The first is switching the objective from a “removal game” to a “preservation game,” which means learning a mask that retains rather than removes the salient information. The second is replacing the penalty on the subset size (or the mask norm) with a constraint. In practice, the constraint is enforced using a penalty, but the authors argue that it should still be viewed as a constraint due to the use of a large regularization parameter.

EP uses the same blurring operation as MP and introduces new tricks to ensure a smooth mask, but our presentation focuses on the most important part of the optimization problem, which is maximizing the classification probability while blurring a fixed portion of the image (Eq. 26).

a.3 Fido-Ca

FIDO-CA Chang et al. (2018) is similar to EP but it replaces the blurring operation with features drawn from a generative model. The generative model

can condition on arbitrary subsets of features, and although its samples are non-deterministic, FIDO-CA achieves strong results using a single sample. The authors consider multiple generative models but recommend a generative adversarial network (GAN) that uses contextual attention 

Yu et al. (2018). The optimization objective is based on the same “preservation game” as EP, and the authors use the Concrete reparameterization trick Maddison et al. (2016) for optimization.

a.4 Minimal Image Representation (MIR)

The Minimal Image Representation approach Zhou et al. (2014)

removes information from an image to determine which regions are salient for the desired class. MIR works by creating a segmentation of edges and regions and iteratively removing segments from the image (selecting those that least decrease the classification probability) until the remaining image is incorrectly classified. We view this as a greedy approach for solving the constrained optimization problem

where represents the prediction with the specified subset of features and represents the minimum allowable classification probability. Our presentation of MIR in the main text focuses on this view of the optimization objective rather than the specific greedy algorithm MIR uses (Eq. 25).

a.5 Masking Model (MM)

The Masking Model approach (Dabkowski and Gal, 2017) observes that removing salient information (while preserving irrelevant information) and removing irrelevant information (while preserving salient information) are both reasonable approaches to understanding image classifiers. The authors refer to these tasks as discovering the smallest destroying region (SDR) and smallest sufficient region (SSR).

The authors adopt notation similar to MP Fong and Vedaldi (2017), using to denote the transformation to the input given a mask . For an input , the authors aim to solve the following optimization problem:

The (total variation) and penalty terms are both similar to MP and respectively encourage smoothness and sparsity in the mask. Unlike MP, MM learns a global explainer model that outputs approximate solutions to this problem in a single forward pass. In the main text, we provide a simplified presentation of the problem that does not include the logarithm in the third term or the exponent in the fourth term (Eq. 28). We view these as monotonic link functions that provide a more complex trade-off between the objectives but that are not necessary for finding informative solutions.

a.6 Learning to Explain (L2X)

The first theorem of the L2X paper (Chen et al., 2018b) says that the explanation they seek is the distribution that optimizes the following objective:

If we replace the conditional probability with a subset function and allow for loss functions other than cross entropy, then we recover the version of this problem that we present in the main text. The L2X paper focuses on classification problems and an interpretation of their approach in terms of mutual information maximization; for a regression task evaluated with MSE loss, the approach could be interpreted analogously as performing conditional variance minimization.

a.7 Instance-wise Variable Selection (INVASE)

The INVASE method Yoon et al. (2018) is very similar to L2X, but it parameterizes the selector model differently. Rather than constraining the explanations to contain exactly

features, INVASE generates a set of features from a factorized Bernoulli distribution conditioned on the input

, using a regularization parameter to control the trade-off between the number of features and the expected value of the loss function. Instead of optimizing the selector model with reparameterization gradients, INVASE is learned using an actor-critic approach.

a.8 Prediction Difference Analysis (PredDiff)

Prediction Difference Analysis (Zintgraf et al., 2017)

removes individual features (or groups of features) and analyzes the difference in a model’s prediction. Removed pixels are imputed by conditioning on their bordering pixels, which approximates sampling from the full conditional distribution

. Rather than measuring the prediction difference directly, the authors use attribution scores based on the log-odds ratio:

We view this as another way of analyzing the difference in the model output for an individual prediction.

a.9 Causal Explanations (CXPlain)

CXPlain removes single features (or groups of features) for individual inputs and measures the change in the loss function (Schwab and Karlen, 2019). The authors propose calculating the attribution values

and then computing the normalized values

The normalization step enables the use of a learning objective based on Kullback-Leibler divergence for the explainer model, which is ultimately used to calculate attribution values in a single forward pass. The authors explain that this approach is based on a “causal objective,” but CXPlain is causal in the same sense as every other method described in our work.

a.10 Randomized Input Sampling for Explanation (RISE)

The RISE method (Petsiuk et al., 2018) begins by generating a large number of randomly sampled binary masks. In practice, the masks are sampled by dropping features from a low-resolution mask independently with probability , upsampling to get an image-sized mask, and then applying a random jitter. Due to the upsampling, the masks have values rather than .

The mask generation process induces a distribution over the masks, which we denote as . The method then uses the randomly generated masks to obtain a Monte Carlo estimate of the following attribution values:

If we ignore the upsampling step that creates continuous mask values, we see that these attribution values are the mean prediction when a given pixel is included:

a.11 Interactions Methods for Explanations (IME)

IME was presented in two separate papers (Štrumbelj et al., 2009; Štrumbelj and Kononenko, 2010). In the original version, the authors recommended training a separate model for each subset of features. In the second version, the authors proposed the more efficient approach of marginalizing out the removed features from a single model .

The latter paper is ambiguous about the specific distribution used to marginalize out held out features (Štrumbelj and Kononenko, 2010). Lundberg and Lee Lundberg and Lee (2017) view that features are marginalized out using their distribution from the training dataset (i.e., the marginal distribution). In contrast, Merrick and Taly Merrick and Taly (2019) view IME as marginalizing out features using a uniform distribution. Upon a close reading of the paper, we opt for the uniform interpretation, but the specific interpretation of IME’s choice of distribution does not impact any of our conclusions.

a.12 TreeSHAP

TreeSHAP uses a unique approach to handle held out features in tree-based models (Lundberg et al., 2020). It accounts for missing features using the distribution induced by the underlying trees, and, since it exhibits no dependence on the held out features, it is a valid extension of the original model. However, it cannot be viewed as marginalizing out features using a simple distribution.

Given a subset of features, TreeSHAP makes a prediction separately for each tree and then combines each tree’s prediction in the standard fashion. But when a split for an unknown feature is encountered, TreeSHAP averages predictions over the multiple paths in proportion to how often the dataset follows each path. This is similar but not identical to the conditional distribution because each time this averaging step is performed, TreeSHAP conditions only on coarse information about the features that preceded the split.

a.13 Shapley Net Effects

Shapley Net Effects was originally proposed for linear models that use MSE loss, but we generalize the method to arbitrary model classes and arbitrary loss functions. Unfortunately, Shapley Net Effects quickly becomes impractical with large numbers of features or non-linear models.

a.14 Shapley Effects

Shapley Effects analyzes a variance-based measure of a function’s sensitivity to its inputs, with the goal of discovering which features are responsible for the greatest variance reduction in the model output (Owen, 2014). The cooperative game described in the paper is:

We present a generalized version to cast this method in our framework. In the appendix of Covert et al. Covert et al. (2020), it was shown that this game is equal to:

This derivation assumes that the loss function is MSE and that the subset function is . Rather than the original formulation, we present a cooperative game that is equivalent up to a constant value and that provides flexibility in the choice of loss function:


  • [1] K. Aas, M. Jullum, and A. Løland (2019) Explaining individual predictions when features are dependent: more accurate approximations to Shapley values. arXiv preprint arXiv:1903.10464. Cited by: §2.2, 7th item.
  • [2] A. Adadi and M. Berrada (2018)

    Peeking inside the black-box: a survey on explainable artificial intelligence (xai)

    IEEE Access 6, pp. 52138–52160. Cited by: §2.2.
  • [3] C. Agarwal and A. Nguyen (2019) Explaining an image classifier’s decisions using generative models. arXiv preprint arXiv:1910.04256. Cited by: §2.2.
  • [4] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2017) Towards better understanding of gradient-based attribution methods for deep neural networks. arXiv preprint arXiv:1711.06104. Cited by: §2.2, §7.
  • [5] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One 10 (7), pp. e0130140. Cited by: §2.2.
  • [6] L. Breiman (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §1, §2.1, §2.2, §3.1, §3.2, 8th item.
  • [7] C. Chang, E. Creager, A. Goldenberg, and D. Duvenaud (2018) Explaining image classifiers by counterfactual generation. arXiv preprint arXiv:1807.08024. Cited by: §A.3, §2.2, 6th item.
  • [8] J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan (2018) L-Shapley and C-Shapley: efficient model interpretation for structured data. arXiv preprint arXiv:1808.02610. Cited by: 3rd item.
  • [9] J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan (2018) Learning to explain: an information-theoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814. Cited by: §A.6, §1, §2.1, §3.1, §3.2, 3rd item, §6.2, 1st item.
  • [10] I. Covert, S. Lundberg, and S. Lee (2020) Understanding global feature contributions through additive importance measures. arXiv preprint arXiv:2004.00668. Cited by: §A.14, §1, §2.1, §2.2, §2.2, §3.2, 7th item, 4th item, 5th item, §5, 1st item, 1st item, 2nd item, §7.
  • [11] P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6967–6976. Cited by: §A.5, §2.1, 2nd item, §6.2.
  • [12] A. Datta, S. Sen, and Y. Zick (2016) Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 598–617. Cited by: §2.2, §2.2, 9th item, 2nd item.
  • [13] R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437. Cited by: §A.1, §A.5, §2.1, §2.1, §3.1, §3.2, 5th item.
  • [14] R. Fong, M. Patrick, and A. Vedaldi (2019) Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2950–2958. Cited by: §A.2, §3.2, 5th item.
  • [15] C. Frye, D. de Mijolla, L. Cowton, M. Stanley, and I. Feige (2020) Shapley-based explainability on the data manifold. arXiv preprint arXiv:2006.01272. Cited by: §2.2, 7th item, 4th item.
  • [16] C. Frye, I. Feige, and C. Rowat (2019) Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability. arXiv preprint arXiv:1910.06358. Cited by: §2.2.
  • [17] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi (2018) A survey of methods for explaining black box models. ACM Computing Surveys (CSUR) 51 (5), pp. 1–42. Cited by: §2.2.
  • [18] I. Guyon and A. Elisseeff (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3 (Mar), pp. 1157–1182. Cited by: 13rd item, §6.1.
  • [19] G. Hooker and L. Mentch (2019) Please stop permuting features: an explanation and alternatives. arXiv preprint arXiv:1905.03151. Cited by: §2.2.
  • [20] D. Janzing, L. Minorics, and P. Blöbaum (2019) Feature relevance quantification in explainable AI: a causality problem. arXiv preprint arXiv:1910.13413. Cited by: §2.2, 8th item.
  • [21] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: 6th item.
  • [22] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman (2018) Distribution-free predictive inference for regression. Journal of the American Statistical Association 113 (523), pp. 1094–1111. Cited by: 13rd item.
  • [23] S. Lipovetsky and M. Conklin (2001) Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry 17 (4), pp. 319–330. Cited by: §1, §2.2, 13rd item.
  • [24] Z. C. Lipton (2018) The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §1.
  • [25] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §A.11, §1, §1, §2.1, §2.1, §2.1, §2.2, §2.2, §3.1, §3.2, 7th item, 8th item, 2nd item, §7.
  • [26] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. Lee (2020) From local explanations to global understanding with explainable ai for trees. Nature Machine Intelligence 2 (1), pp. 2522–5839. Cited by: §A.12, §2.2, 12nd item, 7th item.
  • [27] C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    arXiv preprint arXiv:1611.00712. Cited by: §A.3.
  • [28] L. Merrick and A. Taly (2019) The explanation game: explaining machine learning models with cooperative game theory. arXiv preprint arXiv:1909.08128. Cited by: §A.11, §2.2.
  • [29] T. Miller, P. Howe, and L. Sonenberg (2017) Explainable AI: beware of inmates running the asylum or: how I learnt to stop worrying and love the social and behavioural sciences. arXiv preprint arXiv:1712.00547. Cited by: §1, 3rd item.
  • [30] T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. Cited by: §1, 3rd item.
  • [31] A. B. Owen (2014) Sobol’ indices and Shapley value. SIAM/ASA Journal on Uncertainty Quantification 2 (1), pp. 245–251. Cited by: §A.14, §1, §2.1, §3.1, §3.2, 7th item, 2nd item.
  • [32] V. Petsiuk, A. Das, and K. Saenko (2018) RISE: randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421. Cited by: §A.10, §1, §1, §3.2, 1st item, 4th item, 1st item.
  • [33] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1, §2.1, §2.1, §2.2, §3.1, §3.1, 2nd item.
  • [34] C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §2.1.
  • [35] P. Schwab and W. Karlen (2019) CXPlain: causal explanations for model interpretation under uncertainty. In Advances in Neural Information Processing Systems, pp. 10220–10230. Cited by: §A.9, 1st item, §6.2.
  • [36] L. S. Shapley (1953) A value for n-person games. Contributions to the Theory of Games 2 (28), pp. 307–317. Cited by: §3.2, 5th item.
  • [37] A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje (2016) Not just a black box: learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713. Cited by: §2.2.
  • [38] R. Shu (2017) Amortized optimization. External Links: Link Cited by: 6th item.
  • [39] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.1, §2.1.
  • [40] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §2.1.
  • [41] E. Song, B. L. Nelson, and J. Staum (2016) Shapley effects for global sensitivity analysis: theory and computation. SIAM/ASA Journal on Uncertainty Quantification 4 (1), pp. 1060–1083. Cited by: 1st item.
  • [42] C. Strobl, A. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008) Conditional variable importance for random forests. BMC Bioinformatics 9 (1), pp. 307. Cited by: §1, 7th item.
  • [43] E. Štrumbelj, I. Kononenko, and M. R. Šikonja (2009) Explaining instance classifications with interactions of subsets of feature values.

    Data & Knowledge Engineering

    68 (10), pp. 886–904.
    Cited by: §A.11, §1, §1, §2.2, 13rd item, 2nd item.
  • [44] E. Štrumbelj and I. Kononenko (2010) An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research 11, pp. 1–18. Cited by: §A.11, §A.11, §3.2, 10th item, 1st item.
  • [45] M. Sundararajan and A. Najmi (2019) The many shapley values for model explanation. arXiv preprint arXiv:1908.08474. Cited by: §2.2.
  • [46] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §1, §2.1, §2.1, §2.1, §2.2.
  • [47] R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: 3rd item.
  • [48] S. Xu, S. Venugopalan, and M. Sundararajan (2020) Attribution in scale and space. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 9680–9689. Cited by: §2.1.
  • [49] J. Yoon, J. Jordon, and M. van der Schaar (2018) INVASE: instance-wise variable selection using neural networks. In International Conference on Learning Representations, Cited by: §A.7, §3.2, 3rd item, §6.2.
  • [50] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018)

    Generative image inpainting with contextual attention

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514. Cited by: §A.3, 6th item.
  • [51] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pp. 818–833. Cited by: §1, §1, §2.1, §2.1, §3.2, 1st item.
  • [52] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2014) Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856. Cited by: §A.4, §2.1, §3.2, 4th item.
  • [53] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. Cited by: §2.1.
  • [54] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling (2017) Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §A.8, 7th item.