Sampling, Intervention, Prediction, Aggregation: A Generalized Framework for Model Agnostic Interpretations

04/08/2019 ∙ by Christian A. Scholbeck, et al. ∙ Universität München 0

Non-linear machine learning models often trade off a great predictive performance for a lack of interpretability. However, model agnostic interpretation techniques now allow us to estimate the effect and importance of features for any predictive model. Different notations and terminology have complicated their understanding and how they are related. A unified view on these methods has been missing. We present the generalized SIPA (Sampling, Intervention, Prediction, Aggregation) framework of work stages for model agnostic interpretation techniques and demonstrate how several prominent methods for feature effects can be embedded into the proposed framework. We also formally introduce pre-existing marginal effects to describe feature effects for black box models. Furthermore, we extend the framework to feature importance computations by pointing out how variance-based and performance-based importance measures are based on the same work stages. The generalized framework may serve as a guideline to conduct model agnostic interpretations in machine learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

There has been an ongoing debate about the lacking interpretability of machine learning (ML) models. As a result, researchers have put in great efforts to develop techniques for creating insights into the workings of predictive black box models. The novel field of interpretable machine learning (IML) [19] serves as an umbrella term for all interpretation methods in machine learning research. They may be distinguished in different ways:

  1. Feature effects or feature importance: Predictive models are interpreted with two main goals in mind. First, feature effects indicate the direction and magnitude of change in the predicted outcome when a feature value changes. Prominent methods include the individual conditional expectation (ICE) [10] & partial dependence (PD) [9], accumulated local effects (ALE) [1] and Shapley values [23]. Second, the feature importance is defined as the contribution of a feature to the predictive performance of the fitted model. This includes variance-based measures like the feature importance ranking measure (FIRM) [11, 24] and performance-based measures like the permutation feature importance (PFI) [8], individual conditional importance (ICI) and partial importance (PI) curves [4], as well as the Shapley feature importance (SFIMP) [4]. Input gradients were proposed by [12] as a model agnostic tool for both effects and importance that essentially equals marginal effects (ME) [16], which have a long tradition in statistics. They also define an average input gradient which corresponds to the average marginal effect (AME).

  2. Intrinsic or post hoc interpretability: Linear models (LM), generalized linear models (GLM), classification and regression trees (CART) or rule lists [21]

    are examples for intrinsically interpretable models, while random forests (RF), support vector machines (SVM), neural networks (NN) or gradient boosting (GB) models can only be interpreted post hoc. Here, the interpretation process is detached from and takes place after the model fitting process. e.g. with ICE & PD plots or ALEs.

  3. Model specific or model agnostic interpretations: Interpreting model coefficients of GLMs or deriving a decision rule from a classification tree is a model specific interpretation. Model agnostic methods like ICE & PD plots or ALEs can be applied to any model and are suited for model comparisons.

  4. Local or global explanations: Local explanations like ICE curves evaluate the model behavior when predicting the target variable for one specific observation. Global explanations like the PD curve interpret the model for the entire input space. Furthermore, it is possible to explain model predictions for a group of observations, e.g. on intervals. In a lot of cases, local and global explanations can be transformed into one another via (dis-)aggregation, e.g. with ICE & PD curves.

Motivation: Different terminologies and the variety of available tools complicate the understanding, discussion and research of interpretation techniques in ML. It turns out that deconstructing model agnostic methods into sequential work stages reveals striking similarities. This might not be surprising, considering that the motive for using any model agnostic interpretation technique is to explain the prediction of black box models. Gaining insights into the workings of complex and non-linear prediction functions seems impossible. Instead, all model agnostic techniques work according to the same principle, i.e. they evaluate the model predictions when inputs are changing. However, a unified view on these methods has been missing so far. The motivation for this research paper is to establish a common terminology for model agnostic interpretation methods and to reveal similarities in their computation.

Contributions: In section 3, we formally introduce marginal effects (ME) from statistics as a model agnostic interpretation tool for ML. In section 4, we present the generalized SIPA (Sampling, Intervention, Prediction, Aggregation) framework of work stages for model agnostic techniques as our main contribution. We proceed to demonstrate how several methods to estimate feature effects (MEs, ICE & PD, ALEs and the Shapley value) can be embedded into the proposed framework. Furthermore, in section 5 and 6 we extend the framework to the feature importance by pointing out how variance-based (FIRM) and performance-based (ICI & PI, PFI and SFIMP) importance measures are based on the same work stages. The SIPA framework reduces the diverse set of available techniques to a single methodology. It may therefore establish a common ground to discuss model agnostic interpretations in terms of notation and terminology and inspire the development of novel methods in IML.

2 Notation and Preliminaries

Assume a p-dimensional feature space with index set and a target space . We assume an unknown functional relationship between and plus a random error, i.e.

. A supervised learning model

learns relationships from an i.i.d. training sample with observations that was drawn from the joint space

. The corresponding random variables from the feature space are denoted by

. The random variable from the target space is denoted by Y. The vector corresponds to the i-th observation that is associated with the observed target value . The vector represents the realized values of . Every feature is assigned a unique index value j with . We apply a variety of model agnostic interpretation techniques to a subset of features with index set S (). The complement C denotes unselected features (). If not denoted otherwise, S contains exactly one element. The corresponding random variables and observed feature values are denoted by and respectively. The generalization error

is measured by the loss function

on unknown test data and estimated with a sample of test data .

The partial derivative of the trained model with respect to at point is numerically approximated with a symmetric difference quotient [16].

A term of the form on the interval is called a finite difference (FD) with respect to at point .

The step size h needs to be as small as possible to approximate the differential quotient, but too small values of h will result in cancelation errors.

The prediction function can be decomposed into a sum of lower-dimensional terms. A functional decomposition of is defined as:

There is a constant term . The functions can be regarded as main effects (first order effects), the functions as two-way interactions between the random variables and (second order effects), the functions as three-way interactions between , and (third order effects), continuing up to the p-th order. The functional decomposition of is only unique if we assign additional constraints to its components, e.g. orthogonality constraints and zero means (see [1], [13], [14] or [22]).

3 Feature Effects

Partial dependence (PD) & individual conditional expectation (ICE): First suggested by [9], the PD is defined as the dependence of the true model on one or multiple selected features after all remaining features have been marginalized out [10].

(1)

Every subset of features S has its own PD function that returns the expected value of the target variable for multiple values of while the vector of complementary features varies over its marginal distribution . Unfortunately, neither the true model, nor the marginal distribution are usually known. We are using the fitted model as an approximation to the real model and the sampling distribution as a substitute for the marginal distribution. The PD is estimated via Monte Carlo integration.

Its graphical visualization is called the partial dependence plot (PDP). [9] suggests using the PD as a useful feature effect measure when features are not interacting. Otherwise it can obfuscate the relationships in the data [4]. [10] instead propose the individual conditional expectation (ICE). The i-th ICE curve corresponds to the expected value of the target for the i-th observation as a function of , conditional on the observed vector .

ICE curves disaggregate the global effect estimates of the PD curve to local effect estimates of single observations. ICE and PD suffer from extrapolation when features are correlated, because the permutations used to predict are located in regions without any training data [1].

Accumulated local effects (ALE): [1] develop ALEs as a feature effect measure for correlated features that does not extrapolate. The idea of ALEs is to take the integral with respect to of the first derivative of the prediction function with respect to . This creates an accumulated partial effect of on the target variable while simultaneously blocking out effects of other features. The main advantage of not extrapolating stems from integrating with respect to the conditional distribution of instead of the marginal distribution of [1]. Consider a single feature S. The minimum of the distribution of the random variable is denoted by , and the minimum of the observed marginal distribution by . The corresponding maximum values are denoted by and , respectively. The true first order ALE is defined as:

(2)

A constant is substracted in order to center the plot. We estimate the first order ALE for the whole value range of in three steps:

  1. The continuous function of the partial derivative is generally unknown and has to be estimated interval-wise. It is approximated by computing finite differences (FD) on a suitable discretization of the observed value range into the set of interval boundaries . The estimation at point represents the approximation within the interval . The local effect within the interval corresponds to the integral of the partial derivative from to . As we will integrate the partial derivative with respect to the same variable, dividing FDs by interval widths is not necessary.

  2. The conditional expected value , i.e. the expectation with respect to the marginal distribution of , conditional on , is estimated via Monte Carlo integration within the preceding interval . This replaces the inner integral in eq. (2).

  3. The accumulation of all estimated local effects up to replaces the outer integral in eq. (2), i.e. the estimated local effects within the intervals , are summed up.

[1] suggests partitioning the value range of

according to the quantiles of the sampling distribution. The number of quantiles is determined by a manually specified number of

intervals. Regions with a large concentration of observations receive a smaller partitioning than regions with a low concentration of observations. The partial derivative is approximated by first taking FDs of predictions within each interval. For each i-th observation within the interval , the observational value is substituted by the right interval boundary and by the left one . This corresponds to an FD of predictions with a forward step and a backward step .

The second order ALE is the extension of the first order ALE to a bivariate . It is important to note that first order effect estimates are substracted from the second order estimates. [1] further lays out the computations necessary for higher order ALEs.

The important property of ALEs to block out effects of other features stems from using FDs. [1] calls this additive unbiasedness because additively linked effects of other features in the prediction function are blocked out. Consider a prediction function with a sole main effect of feature S, denoted by , and arbitrary effects of other features. The FD of predictions with observed values x within the interval results in the FD of only, which further leads to an estimated ALE of the main effect only:

Marginal effects (ME): MEs are an established technique in statistics and commonly applied in econometrics [15]

. They are often used to interpret non-linear functions of coefficients in GLMs like logistic regression by approximating their first derivatives. We are extending this concept to any differentiable prediction function. Although there is extensive literature on MEs, this concept was suggested by

[12] as a novel method for ML and referred to as the input gradient. Consider a prediction function with a constant, as well as first and second order effects of two features. Taking the FD of predictions in the nominator blocks out the constant and additively linked main effect of the second feature in eq. (3).

(3)

MEs are defined locally and the feature space usually is multi-dimensional. There are three established variations of how to specify the locations of derivatives in the feature space [16].

Average marginal effects (AME): The AME of on the target corresponds to the average of all MEs at observed values of the feature space. It is a scalar value that may be interpreted as a global feature effect estimate. It is important to take into consideration that aggregating MEs is connected with a loss of information when the MEs have a large variance, i.e. when the prediction function has a strong curvature.

Marginal effects at the mean (MEM): For MEMs, all feature values in are substituted by their sampling distribution means before estimating MEs. The effects may be averaged. When averaged, the MEM is a variation of the AME on a modified dataset. MEMs can be misleading as the multi-dimensional sample mean of

may not be observed or even observable, especially for dummy variables

[2].

Marginal effects at representative values (MER): MERs evaluate the prediction function on a modified dataset in a similar way to MEMs. The only difference being that the modified values are specified manually and do not depend on the sampling distribution. MERs can be considered to be conditional MEs. This is especially useful for evaluating counterfactuals. Representative values should be chosen inside the training data space because otherwise, the fitted model might extrapolate.

Shapley value

: Originating in coalitional game theory

[23]

, the Shapley value is a local feature effect measure that is based on a set of desirable axioms. In coalitional games, a set of p players, denoted by P, play games and join coalitions. They are rewarded with a payout. The characteristic function

maps all player coalitions to their respective payouts [4]. The Shapley value is a player’s average contribution to the payout, i.e. the marginal increase in payout for the coalition of players, averaged over all possible coalitions. For Shapley values as feature effects, predicting the target for a single observation corresponds to the game and a coalition of features represents the players. Shapley regression values were first developed for linear models with multicollinear features [17]. A model agnostic Shapley value was first introduced in [23]. Consider the expected prediction for an observed vector of feature values x, conditional on only observing feature values of subset K, i.e. features in C are marginalized out. This essentially equals a point (or a line, surface etc. depending on the power of K) on the PD from eq. (1).

(4)

Eq. (4) is shifted by the mean prediction and used as a payout function , such that an empty set of features () results in a payout of zero [4].

(5)

From eq. (5), it follows that the marginal contribution of feature values joining the coalition of values is:

(6)

The exact notation of the Shapley value for a single feature S with observational values x is:

Shapley values are computationally expensive because the PD function has a complexity of . Computations can be sped up by Monte Carlo sampling [23] or taking advantage of possible graph structures in the data [5]. Furthermore, [18] proposes a distinct variant to compute Shapley values called SHapley Additive exPlanations (SHAP).

4 Generalized Framework

All demonstrated techniques are based on the same principles. Instead of trying to inspect the inner workings of a non-linear black box model, we are evaluating its predictions when changing inputs. The SIPA work stages (Sampling, Intervention, Prediction, Aggregation) serve as a framework for thinking about model agnostic techniques. The software package iml [20] contains implementations of all demonstrated techniques and was inspired by the SIPA framework.

Often, the amount of available data is large and the utilized algorithms computationally expensive. We first sample a subset (sampling stage) to reduce computational costs, e.g. selecting a random set of available observations to evaluate as ICE curves. In order to change the predictions made by the black box model, the data has to be manipulated. Feature values can be set to values from the observed marginal distributions (ICE & PD curves or Shapley values), or to to unobserved values (FD based methods such as MEs and ALEs). This crucial step is called the intervention stage. During the prediction stage, we predict on previously intervened data. This requires an already fitted model, which is why model agnostic techniques are always post hoc. The predictions are further aggregated during the aggregation stage. Often, the predictions resulting from the prediction stage are local effect estimates, and the ones resulting from the aggregation stage are global effect estimates. Optionally, results can be visualized. In fig. 1, we demonstrate how all demonstrated techniques for feature effects are based on the SIPA framework. For every method, we optionally sample a random subset of observations from the test data (sampling stage) and then proceed to the intervention stage.

Figure 1: All demonstrated techniques for feature effects are based on the SIPA framework. Note that one point of each ICE curve corresponds to observed feature values. For reasons of simplicity, we refer to all points on ICE curves as intervened data.

ICE & PD: For each observation, we create copies of the observed vector and create an identical set with permuted values of (intervention stage). We predict for each permutation (prediction stage). The predictions for one observation correspond a single ICE curve. All estimated ICE curves may then be averaged point-wise to the PD (aggregation stage).

ALE: We first partition the value range of into a set of intervals and select a subset of observations in each interval. We iterate over every interval. For every observation inside an interval, we set its value of to the right and left interval boundary (intervention stage). We predict at both boundaries (prediction stage). All FDs inside an interval are averaged to an interval-wise average FD. We integrate the partial derivative to the ALE by summing up the interval-wise average FDs (aggregation stage).

MEs: For MEMs, we first set all feature values in for all observations to the sampling distribution means. For MERs, we set some feature values in for some observations to manually specified values. For AMEs, we use the observed data. In all variants, we construct FDs for each observation, i.e. we take a forward and backward step with size (intervention stage). We predict at both interval boundaries (prediction stage). In the aggregation stage, we compute the FD of predictions and divide by the interval width in order to get ME estimates. For AMEs and optionally MEMs, ME estimates can be averaged.

Shapley value: We first construct all possible feature sequences without feature S. For each sequence, we iterate over all observations and construct permutations for ICE curves, once with and once without feature S (intervention stage). We predict for all permutations (prediction stage). In the aggregation stage, we average the ICE curves of both groups to their PD functions. Next, we plug in the observed vector x and take the difference of PD values, i.e. the marginal contribution to the PD of feature S being included in the sequence. Lastly, we sum up the weighted marginal contributions to the local Shapley value (aggregation stage).

5 Feature Importance

Methods to determine feature effects (direction and magnitude) are tightly intertwined with those determining the importance (contribution to the predictive performance) of a feature. Choosing an appropriate set of features plays an important role in the model building process (feature selection). We are instead interested in interpreting a single, already trained model with a fixed set of features.

In linear models, the quotient of the absolute coefficient and the corresponding standard error serves as a natural feature importance metric. This corresponds to the absolute value of the t-statistic

[11]. In the case of black box models, we neither know the analytical form of coefficients nor their standard errors. We categorize importance measures in two groups: variance-based and performance-based.

Variance-based: A mostly flat trajectory of a single ICE curve implies that in the underlying predictive model, varying does not affect the prediction for this specific observation. If all ICE curves are shaped similarly, the PD can be used instead. [11] propose a measure for the curvature of the PD as a feature importance metric. Let the average value of the PD of feature S be denoted by . The importance

of feature S corresponds to the estimated standard deviation of the feature’s PD function. The flatter the PD, the smaller its standard deviation and therefore the feature importance. For categorial features, the range of the PD is divided by 4. This is supposed to represent an approximation to the estimate of the standard deviation for small to medium sized samples

[11].

[24] propose the feature importance ranking measure (FIRM). They define a conditional expected score (CES) function for feature S.

(7)

It turns out that eq. (7) is equivalent to eq. (1), conditional on . It follows that the CES is a conditional PD.

The FIRM corresponds to the standard deviation of the CES function with all values of used as conditional values. This in turn is equivalent to the standard deviation of the PD curve and therefore to the feature importance metric proposed by [11].

Performance-based: The permutation feature importance (PFI), originally developed by [3] as a model specific tool for random forests, was described as a model agnostic one by [7]. If feature values are shuffled in isolation, the relationship between the feature and target is broken up. If the feature was important for the predictive performance, the shuffling should result in an increased loss [4]. The model agnostic PFI of a feature S measures the difference between the GE on data with permuted and non-permuted values. Permuting corresponds to drawing from a new random variable that is distributed like but independent of [4].

Consider the sample of test data where the columns of S have been permuted and the non-permuted sample . The PFI estimate is given by the difference between GE estimates with permuted and non-permuted data.

(8)

Extending the idea of measuring differences in performance, [4] propose individual conditional importance (ICI) and partial importance (PI) curves as visualization techniques that disaggregate the global PFI estimate. They are based on the same principles as ICE and PD. The ICI visualizes the influence of a feature on the predictive performance for a single observation, while the PI visualizes the average influence of a feature for all observations. Consider the prediction for the i-th observation with observed values and the prediction where was replaced by a value from the marginal distribution of observed values . The change in loss is given by:

The ICI curve of the i-th observation plots the value pairs for all values of . The PI curve is the point-wise average of all ICI curves at all values of . It plots the value pairs for all values of . Estimating the ICI is nearly identical to estimating the ICE with only two differences. First, we not only predict on an intervened dataset during the prediction stage, but instead we predict on both intervened and non-intervened data. Second, in the aggregation stage, we use the loss functions instead of the absolute predictions and compute the difference between both losses. Substituting values of essentially resembles shuffling them. The authors demonstrate how integrating the PI curve with respect to results in an estimation of the global PFI.

Furthermore, [4] propose a feature importance measure they call Shapley feature importance (SFIMP). Shapley importance values were first introduced by [6] for feature selection. This requires refitting the model with distinct sets of features, which changes the behavior of the learning algorithm and is not helpful to evaluate a single model, as noted by [4]. The SFIMP is based on the same principle as deriving ICI from ICE curves, i.e. using the same computations as the Shapley value but replacing the payout function with one that is sensitive to the model performance. The authors define a new payout that substitutes the estimated PD with the estimated generalization error (GE). This is equivalent to the expected PFI from eq. (8).

We can therefore refer to as and regard the SFIMP as an extension to the PFI [4].

6 Extending the Framework to the Feature Importance

It may not surprise that feature importance measures are also based on the SIPA framework. Variance-based methods simply measure the variance of some kind of feature effect estimate, which we already demonstrated to be based on the SIPA framework. Performance-based techniques measure some form of loss, i.e. there are two possible modifications. First, we predict on non-intervened or intervened data (prediction stage). Second, we aggregate predictions to the loss (aggregation stage).

In fig. 2, we demonstrate how feature importance computations are based on the same work stages as feature effect computations.

Figure 2: Extending the framework to the feature importance. Model agnostic importance measures are based on the same work stages as methods for feature effects. Variance based performance measures evaluate the variance of effect measures. Performance based measures evaluate the loss function.

7 Conclusion

Model agnostic interpretation techniques recently garnered widespread attention. The lack of interpretability of most ML models proved to be an obstacle for their adoption, e.g. in high-stakes decision making. Modularity is a central motive in using ML, i.e. the ability to algorithmically train and substitute models according to certain performance measurements. Therefore, model agnostic interpretation methods represent a major stepping stone in advancing the field, because they allow us to substitute interpretation techniques as well. A variety of model agnostic methods has been developed. Different terminologies complicated the understanding of methods and how they are related to each other. By deconstructing them into sequential work stages, one discovers striking similarities in their methodologies. With our contributions, we hope to work towards a unified view on model agnostic techniques and to facilitate their understanding and discussion.

In our first contribution, we formally added marginal effects from statistics to the range of black box interpretation techniques in ML. Our main contribution was to present the generalized SIPA framework of work stages for model agnostic interpretations. Essentially, all model agnostic methods are based on four work stages. First, there is a sampling stage to reduce computational costs. Second, we intervene in the data in order to change the behavior of the black box model. Third, we predict on intervened or non-intervened data. Fourth, we aggregate predictions. Optionally, we can visualize the results. We embedded multiple methods to estimate the effect (ICE & PD, ALEs, MEs and Shapley values) and importance (FIRM, PFI, ICI & PI and the SFIMP) of features into the framework.

Realizing that model agnostic interpretation techniques are essentially operating according to the same principle, may provide researchers and applicants of ML with more clarity about the workings of methods they are using. The SIPA framework may therefore serve as a guideline to conduct model agnostic interpretations and inspire the development of novel methods.

Acknowledgments

This work is supported by the Bavarian State Ministry of Science and the Arts as part of the Centre Digitisation.Bavaria (ZD.B) and by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.

References