Model-agnostic Feature Importance and Effects with Dependent Features – A Conditional Subgroup Approach

by   Christoph Molnar, et al.
Universität München

Partial dependence plots and permutation feature importance are popular model-agnostic interpretation methods. Both methods are based on predicting artificially created data points. When features are dependent, both methods extrapolate to feature areas with low data density. The extrapolation can cause misleading interpretations. To overcome extrapolation, we propose conditional variants of partial dependence plots and permutation feature importance. Our approach is based on perturbations in subgroups. The subgroups partition the feature space to make the feature distribution within a group more homogeneous and between the groups more heterogeneous. The interpretable subgroups enable additional local, nuanced interpretations of the feature dependence structure as well as the feature effects and importance values within the subgroups. We also introduce a data fidelity measure that captures the degree of extrapolation when data is transformed with a certain perturbation. In simulations and benchmarks on real data we show that our conditional interpretation methods reduce extrapolation. In an application we show that these methods provide more nuanced and richer explanations.



There are no comments yet.


page 1

page 2

page 3

page 4


Grouped Feature Importance and Combined Features Effect Plot

Interpretable machine learning has become a very active area of research...

Please Stop Permuting Features: An Explanation and Alternatives

This paper advocates against permute-and-predict (PaP) methods for inter...

Transforming Feature Space to Interpret Machine Learning Models

Model-agnostic tools for interpreting machine-learning models struggle t...

Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process

Scientists and practitioners increasingly rely on machine learning to mo...

Relative Feature Importance

Interpretable Machine Learning (IML) methods are used to gain insight in...

Bringing a Ruler Into the Black Box: Uncovering Feature Impact from Individual Conditional Expectation Plots

As machine learning systems become more ubiquitous, methods for understa...

Hollow-tree Super: a directional and scalable approach for feature importance in boosted tree models

Current limitations in boosted tree modelling prevent the effective scal...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many machine learning (ML) interpretation methods (see [18, 10] for an overview) are based on making predictions on perturbed input features, e.g., by permuting feature values. The partial dependence plot (PDP) [9] and permutation feature importance (PFI) [8]

perturb individual features without conditioning on the remaining features, i.e., feature values are changed while ignoring the joint distribution. If features are dependent, such perturbations will cause predictions that extrapolate to areas of the feature space with low density. Extrapolation can result in misleading interpretations

[12] (see Fig. 1).

Figure 1: Simulation of features and and prediction model . Left: Scatter plot with 100 data points and the prediction surface of . Right: PDP of . The dotted line marks the largest observed prediction in the data. Conclusion: The PDP is misleading since it indicates that the model predicts values above 3 (on average) for , while no realistic data point would produce the respective prediction. The extrapolation is caused by unconditional perturbation. For the PDP at all observed are considered, ignoring that only are realizable for this distribution.

An obvious approach to avoid extrapolation would be to perturb a feature conditional on all other features and thereby preserve the joint distribution. The interpretation of conditional feature effect and importance differ from the unconditional variants (see Fig. 2). The conditional effect of a feature is a mixture of its unconditional effect and the unconditional effects of all dependent features. The conditional PFI must be interpreted as the importance of a feature given the other features. If two features are highly dependent, their conditional importance is lower than their unconditional importance, because their shared information can be substituted by the other feature. For global interpretation methods, there is a trade-off between avoiding extrapolation and unconditional interpretation of feature effects and importance.

Figure 2: Simulation of a linear model with and a correlation of 0.978 between and . Left: PDP and M-Plot (conditional PDP variant) for feature . The M-Plot mixes the effects of and and thus shows a positive effect. Right: The PFI of decreases when is permuted conditional on and vice versa. Feature is conditionally less important than although both have the same coefficient in the linear model.

1.0.1 Contributions:

We propose novel, model-agnostic variants of the conditional PDP and conditional PFI based on interpretable subgroups. Our approach is based on constructing subgroups in which the feature of interest is independent from other features and values within the groups are permuted. Subgroup permutation greatly reduces extrapolation while maintaining unconditional interpretation within the subgroups. Furthermore, we introduce a data fidelity measure that quantifies the ability of an interpretation method to preserve the data distribution. Using simulated and real data, we show that conditional subgroup permutation achieves state-of-the-art data fidelity. We compare our conditional subgroup PFI with the true cPFI in a simulation and demonstrate state-of-the-art performance. In an application, we illustrate how our conditional PDP and PFI can reveal new insights into the ML model and the data.

2 Notation and Background

We consider ML prediction functions , where is a model prediction and is a

-dimensional feature vector. We define

as an observed single feature (vector) and to refer to the

-th feature as a random variable. With

we refer to complementary feature space . We refer to the value of the -th feature value from the -th instance as and to the tuples as data.

Permutation Feature Importance (PFI) for a feature

is estimated as the average increase in prediction loss when the feature is permuted in training or test data:


where is a permutation of and the number of repeated permutations. Numerous variations of this formulation exist. Breiman [3]

proposed the PFI for random forests, which is computed from the out-of-bag samples of individual trees. Subsequently, Fisher et. al

[Fisher2018] introduced a model-agnostic PFI version.

The Partial Dependence Plot (PDP) [9] describes the average effect of the j-th feature on the prediction. The PDP evaluated at feature value is:


3 Related Work

Conditional PDP. The marginal plot (M-Plot) [2] averages the predictions locally on the feature grid and mixes effects of dependent features (see Fig. 2).
Hooker (2007) [11] proposed a functional ANOVA decomposition with hierarchically orthogonal components. The decomposition requires access to the joint distribution of the data. The approach has the undesirable property that in e.g. a linear model the coefficients are not recovered when features are correlated. Accumulated Local Effect (ALE) plots by Apley and Zhu [2] reduce extrapolation by accumulating the finite differences computed within intervals of the feature of interest. Interpretations of ALE plots are, by definition, only locally valid. Furthermore, there is no satisfactory approach to derive ALE plots for categorical features, since ALE requires ordered feature values. Our proposed approach can handle categorical features.
Another PDP variant based on stratification was proposed by [19]. However, this stratified PDP describes only the data and is independent of the model.

Conditional PFI. Strobl et. al [24] proposed a conditional PFI for the random forest. While [24] relies on the splits of the underlying random forest trees and permutes the features within these subgroups, we construct the subgroups explicitly from the conditional distribution of the features in a model-agnostic way.
Hooker and Mentch [12] suggested four methods for conditional feature importance: Conditional Variable Importance, Dropped Variable Importance, Permute-And-Relearn Importance and the Condition-and-Relearn Importance. Our proposed method is a variant of the Conditional Variable Importance measure based on interpretable subgroups.
Knockoffs are random variables which are ”copies” of the original features that preserve the joint distribution but are otherwise independent of the prediction target. Knockoffs can be used to replace feature values for conditional feature importance computation. Candes et. al [4] proposed knockoffs based on the correlation structure of the features. Others have proposed to use generative adversarial networks for generating knockoffs [22]. Knockoffs are not transparent with respect to how they condition on the features, while we report interpretable subgroups.

4 Conditional Subgroups

PFI and PDPs are based on sampling from marginal feature distributions which causes extrapolation when features are dependent [12]. Conditional variants of PFI and PDPs (see Section 3) avoid extrapolation by sampling from distributions conditional on the remaining features. However, with conditional PFI and conditional PDP the data dependencies between and can influence the sample, leading to an intepretation that mixes properties of the model with properties of the dataset [15].
We suggest approaching the dependent feature problem by constructing an interpretable grouping such that the feature of interest is independent of the remaining features within each subgroup, i.e. . Sampling from the group-wise marginal distribution reduces extrapolation (Fig. 3).

Figure 3: Suppose feature and , if , otherwise (black dots). Top left: The cross-shaped points represent data created by permuting . These points are further away from the black dots if , which causes extrapolation. Bottom left: Marginal density of . Top right: Permuting within subgroups based on ( and ) reduces extrapolation. Bottom right: Densities of conditional on the subgroups.

Within a group, samples from the marginal and the conditional distribution coincide. The grouping consequently enables (1) the application of standard PFI and PDPs within each group without extrapolation and (2) sampling from the global conditional distribution and using group-wise permutation. With our approach we exploit these properties to derive both (1) group-wise unconditional and (2) global conditional interpretations. The group-wise unconditional PFIs and PDPs can be seen as a decomposition of the global conditional interpretation.
To get a good approximation of the marginal distribution in a group, the group should contain sufficient observations. Moreover, the groupings should be human-intelligible. Existing approaches that model the conditional distribution for interpretation [4, 24, 1] do not provide such a coarse, explicit interpretable grouping.
Transformation trees: We use transformation trees [14] to model the conditional distribution of the feature of interest given features . This approach partitions the feature space so that the distribution of within the resulting subgroups is homogeneous, i.e. the group-wise parameterization of the modeled distribution is independent of . By specifying a maximum tree depth or the minimum number of observations within a node, the granularity of the partitioning can be traded off with the homogeneity of distributions within a partition. Partitions can be described with the conditions that determine its boundaries, e.g. in form of the decision path. We leverage this partitioning to construct an interpretable grouping . The new variable can be calculated by assigning every observation the indicator of the partition that it lies in (meaning for with the group variable’s value is defined as ).
In theory, the approach faces two challenges. First, not every distribution can be perfectly partitioned into homogeneous and interpretable parts (e.g. in the case of linear Gaussian dependencies). However, the granularity of the grouping can be adjusted using the model’s hyper parameters. As empirical results show, the method’s performance is equal to or better than existing approaches in ground-truth evaluations (Section 6.3). Second, the distribution we specify for the model needs to be able to capture the dependencies between and , for to hold. However, the approach is in principle agnostic to the specified distribution and the default transformation family of distributions is very general, as empirical results suggest [14]. In most settings, it is therefore reasonable to assume For more detailed explanations of transformation trees please refer to [14].

For the remainder of this paper, we have set the minimum number of observations in a node to 30, used Bernstein polynomials of degree five for the transformation function and the Normal distribution as target distribution. We denote the subgroups by

, where is the k-th subgroup for feature j, with groups in total for the j-th feature. The subgroups are disjoint: and . Let be a subset of that refers to the data subset belonging to the subgroup .

4.1 Conditional Permutation Feature Importance

We estimate the PFI of feature within a subgroup as , where refers to the permutation of within the subgroup . Algorithm 1 describes the cPFI estimation for one feature in detail on unseen data.

Input: Model , data , loss , feature , no. permutations
1 Compute subgroups on , for  do
2       Select subset Compute error for  do
3             Generate by permuting feature . Estimate error vector
4      Compute subgroup importance
Algorithm 1 Conditional Permutation Feature Importance

The algorithm has two outcomes: We get importance values for feature for each subgroup () and a global conditional feature importance (). The latter is equivalent to the weighted average of subgroup importances regarding the number of observations within each subgroup (Appendix 0.A).

The cPFI needs the same amount of model evaluations as the PFI ().

4.2 Conditional Partial Dependence Plot (cPDP)

The conditional PDP has a different interpretation than the unconditional PDP, as the motivating example shows (Fig. 2). The proposed PDP variant solves the problem of extrapolation while allowing an unconditional interpretation within each subgroup. Since within groups the marginal and conditional distribution coincide, we compute the for each group using the (unconditional) standard PDP formula in Equation 2. This results in multiple PDPs per feature, which can be displayed together in the same plot as in Fig. 8.

Again, we do not only get the groupwise result. We can aggregate subgroup PDPs to yield the conditional PDP (cPDP). A proof is given in Appendix 0.B.

We restrict each subgroup to the interval

. For our visualization, we suggest to plot the PDPs similar to boxplots, where the dense center quartiles are indicated with a bold line (see Fig.

The subgroups PDPs do not break if features are independent.

Theorem 4.1

When feature is independent of features , each subgroup PDP has the same expectation as the unconditional PDP, and an

-times larger variance, where

and are the number of observations in the data and the subgroup .

The proof is shown in Appendix 0.C. Equivalence in expectation and higher variance under independence of and holds true even if the partitions would be randomly chosen.
Assuming we perform permutations in both settings, both the PDP and the set of subgroup need evaluations, since (and worst case if evaluated at each value).

Figure 4: Left: Normal PDP. Bottom right: Boxplot showing the distribution of feature . Top right: PDP using boxplot emphasis. In -range, the PDP is drawn from , , where is the range between the and quantile. If this range exceeds or

the PDP is capped. Outliers are drawn as points. The PDP is bold between the

and quantiles.

5 Data and Model Fidelity

5.1 Data Fidelity

PDP and PFI work by data perturbation, prediction and subsequent aggregation [23]. We define a measure of data fidelity to quantify the ability to preserve the joint distribution under perturbation.

Definition 1 (Data Fidelity)

Data fidelity is the degree to which a perturbation of feature preserves the joint distribution of , i.e. the degree to which

This definition is similar to a property required for knockoffs, see e.g. [4]. Based on data , perturbations create a new dataset which is to be compared to the original data distribution. In this two-sample test-scenario, the maximum mean discrepancy (MMD) can be used to compare whether two samples come from the same distribution. We propose to measure the data fidelity with the empirical MMD:


where is the original dataset and a dataset with perturbed . As kernel

we used the radial basis function kernel for all experiments. We require the features to be scaled to a mean of zero and a standard deviation of one. Categorical features are one-hot encoded. For parameter

of the radial basis function kernel, we chose the median L2-distance between data points.

5.2 Model Fidelity

Model fidelity has been defined as how close the predictions of an explanation method are to the ML model [21]. Similarly, we define model fidelity for feature effects as the mean squared error between model prediction and the prediction by the partial function (which depends only on feature ) defined by the feature effect method. For a given data instance, the predicted outcome from an, e.g., PDP is the y-axis value at the observed value.


where is a feature effect function such as ALE or PDP. In order to evaluate ALE plots, they have to be adjusted such that they are on a comparable scale to a PDP [2], i.e., .

6 Evaluation

6.1 Data Fidelity Evaluation

We evaluated how different types of perturbations affect the data fidelity measure (based on MMD) for numerous datasets (see Table 1).

wine satellite wind space pollen quake
No. of rows 6497 6435 6574 3107 3848 2178
No. of features 12 37 15 7 6 4
Table 1: We selected data sets from OpenML [25, 5] having 1000 to 8000 instances and a maximum of 50 numerical features. We excluded data sets with categorical features, since ALE cannot handle them.

We used 40% of the data to fit the transformation trees to find the subgroups. The remaining 60% were split in half. One half remained unchanged, while we perturbed one of the features in the other half. Then, we computed the MMD comparing the two datasets. We sampled a subset of observations from the PDP / PFI perturbation dataset by permuting once for each observation , so that we get a perturbed dataset of the same size as the original data . This differs from the PDP definitions where for each observation is replaced by a set of grid values. According to [6], PFI and PDP can be formulated with the same underlying feature replacement strategy, either by replacing the feature values using pre-defined grid points (as usually done in PDP) or by permuting the feature values (as usually done in PFI). For the perturbation in subgroups, we permuted once within each subgroup. For the interval-based perturbation of ALE plots, we used a grid based on 30 quantiles that determine the intervals. We averaged two MMD computations: Once when moving each observation to the left border of the containing ALE interval, and once to the right border. For Model-X knockoffs [4] we replaced the feature with its knockoff. We repeated the experiment 30 times with different random seeds.

Figure 5: MMD for different perturbation types: unconditional permutation (Perm), conditional subgroup permutation (CSx, x=max. depth), Model-X knockoffs (xKO), ALE, and without permutation (BL). Each curve is the MMD for a feature, averaged over 30 samples. Lower is better.

Fig. 5 shows that the PDP/PFI type of perturbation has a low data fidelity (high MMD) compared to all other approaches. Model-X knockoffs and conditional subgroup permutation (with many groups) have the best data fidelity. Even splitting with a maximum depth of only 1 (two subgroups) strongly improves data fidelity. The deeper the trees are, the more subgroups are found and the better the data fidelity. Ranked across all features and datasets, the average rankings show that deep conditional subgroups even outperform ALE, see Table 2.

BL xKO CS10 CS5 ALE CS2 CS1 Perm
Mean ranks 2.06 2.97 2.99 3.74 4.23 5.61 6.68 7.72
Table 2: Mean ranks based on MMD of various perturbation methods over datasets, features and repetitions. Legend: CSx: Subgroup permutation with maximal depth of x. Perm: Unconditional permutation. ALE: ALE perturbation [2]. xKO: Model-X knock-offs [4]. BL: No intervention.

6.2 Model Fidelity Evaluation

In this section we evaluate the model fidelity of PDP, ALE and subgroup PDPs. We trained random forests (500 trees), linear models and k-nearest neighbours models (k = 7) on various datasets (Table 1). 70% of the data were used to train the models and the transformation trees. 30% of the data were used to evaluate model fidelity. For each model and each dataset, we measured model fidelity between effect prediction and model prediction (Equation 4), averaged across observations and features. Table 3 shows that the model fidelity of ALE and PDP is similar, while the subgroup PDPs have the best model fidelity. This is interesting since the grouping is neither based on the model nor the real target, but solely on the conditional dependence structure of the features.

pollen quake satellite space wind wine
PDP 9.56 0.03 4.77 0.12 43.86 0.73
ALE 9.56 0.03 4.77 0.12 43.86 0.73
cPDP(1) 9.56 0.03 4.43 0.07 28.81 0.70
cPDP(2) 8.08 0.03 3.18 0.05 24.15 0.67
cPDP(5) 7.13 0.03 2.73 0.03 18.94 0.60
cPDP(10) 7.06 0.03 2.34 0.02 17.93 0.59
Table 3:

Median model fidelity averaged over features in a random forest for various datasets. The cPDPs always had a lower loss (i.e. higher model fidelity) than PDP and ALE. The loss monotonically decreases with increasing maximum tree depth for subgroup construction. Using different models (knn or linear model) produced similar results, see Appendix


6.3 Conditional Feature Importance Evaluation

We computed the true conditional feature importance for following simulated linear model: , where with . In the simulations, we varied the correlation between and , while remained independent. We repeated the experiments 30 times and sampled 1000 data points in each repetition.

We examined two experimental settings. In setting (I) we assumed that our machine learning model recovered the true linear regression model

. We measured the absolute distance for the true cPFI of feature (derivation in Appendix 0.D) and the cPFI based on subgroups with different tree depths for subgroup generation. In setting (II) we trained a random forest (with 100 conditional inference trees [13], mtry = 2 and maxdepth = 10). For this random forest we computed the cPFI with various methods (computed on the random forest) and compared it to the true cPFI (based on the data generating process). We compared our subgroup cPFI approach, the random forest based cPFI by Strobl et. al [24], and Model-X knockoffs [4].
Fig. 6 shows that (I) the deeper the transformation trees (and the more subgroups), the better the true cPFI is approximated and (II) our subgroup-based cPFI approach is equal or superior to the state-of-the art.

Figure 6: Left: Experiment (I) comparing the absolute difference of subgroup cPFI from true cPFI for different correlation strengths and different tree depths for subgroup identification. The higher the correlation is, the worse the cPFI computation when using few subgroups. Right: Experiment (II) comparing various cPFI approaches on a random forest against the true cPFI based on the data generating process.

7 Application

On the following practical application we demonstrate that subgroup based conditional variants are a valuable tool to understand model and data beyond insights given by PFI, PDPs or ALE plots.
We trained a random forest to predict daily bike rentals [7] with given weather and seasonal information. The data (, ) was divided into 70% training and 30% test data. The features in the bike data are dependent. For example, the correlation between temperature and humidity is 0.13 . The data contains both categorical and numerical features and we are interested in the multivariate, non-linear dependencies, thus correlation is an inadequate measure of dependence. We therefore indicate the degree of dependence by showing the extent to which we can predict each feature from all other features in Table 4. Per feature, we trained a random forest to predict the feature from all other features. Random forests can capture non-linear dependencies and interactions and work reasonably well without tuning. We measured the proportion of loss explained to quantify the dependence of the respective feature on all other features. For numerical features we used the R-squared measure. For categorical features we computed , where is the mean misclassification error, the true class, the classification function of the random forest and the most common class in the training data. We divided the training data into two folds and trained the random forest on one half. Then we computed the proportion of explained loss on the other half and vice versa. Finally we averaged the results.

season yr holiday weekday workingday weathersit temp hum windspeed
45% 5% 38% 14% 100% 42% 66% 42% 10%
Table 4: Percentage of loss explained by predicting a feature from the remaining features with a random forest.

To construct the subgroups, we set the maximum tree depth to 2, i.e. we limited the number of possible subgroups to 4. We compared the unconditional and conditional PFI for the bike rental predictions, see Fig. 7.

Figure 7: Left: Comparison PFI and cPFI for a selection of features. For cPFI we also show the features that constitute the subgroups. For year (yr) no subgroups were found. Right: PFI of temperature within subgroups. The temperature feature is important in spring, fall and winter, but neglectable on humid summer days.

The most important feature, according to PFI, was the temperature. Temperature is less important when we condition on season and humidity. To get a deeper understanding of the temperature effect, we examined the effect plots (see Fig. 8). Both ALE and PDP show a monotonous increase of predicted bike rentals up until 25 C and a decrease beyond that. The PDP shows a weaker negative effect of very high temperatures which might be caused by extrapolation: High temperatures days are combined with e.g. winter. A limitation of the ALE plot is that we can only interpret it locally. In contrast, our subgroups are explicit about the subgroup conditions in which the interpretation of the PDP is valid and shows the distributions in which the feature effect may be interpreted. The PDPs in subgroups reveal a more nuanced picture: For dry summer days, increasing temperature mostly has a negative effect on the predicted number of bike rentals.

Figure 8: Effect of temperature on predicted bike rentals. Left: PDP and ALE plot. Right: PDPs for 4 subgroups.

The change in intercepts of the subgroup PDP can be interpreted as the effect of the grouping features (season and humidity). The slope can be interpreted as the temperature effect within a subgroup.
We also demonstrate the subgroup PDPs for the season, a categorical feature. Fig. 9 shows both the PDP and our subgroup PDPs. The normal PDP shows that on average there is no difference between spring, summer and fall and only slightly less bike rentals in winter. The PDP with four subgroups conditional on temperature shows that the unconditional PDP is misleading.

Figure 9: Effect of season on predicted rentals. Left: PDP. Right: PDPs in subgroups. Conclusion: The PDP indicates that in winter around less bikes are rented, while the other seasons are similar. The subgroup PDPs show that, conditional on temperature, the differences between the seasons are much greater, especially for low temperatures. At high temperatures, the number of rented bikes is similar between seasons.

8 Discussion

We proposed the conditional PDP and the conditional PFI, both based on subgroups. This research addresses the inherent conflict between extrapolation and an unconditional interpretation. Our subgroup-based approach unites the best of both worlds: It reduces extrapolation and makes the conditioning explicit through interpretable subgroups, while allowing unconditional interpretation within the subgroups. We have shown that permuting data within subgroups greatly improves data fidelity compared to unconditional permutation. For this purpose we introduced a data fidelity measure based on the maximum mean discrepancy. As a surprising finding, the model fidelity of the subgroup PDPs is better than that of ALE or PDPs. In a simulation we showed that our conditional PFI exceeds or is equal to the state-of-the-art for conditional PFI. The measure of data fidelity can be used for other interpretation methods as well. For local explanation methods such as LIME [21] or Shapley Values [17] it could be adapted to measure local data fidelity.

Computational Details. All experiments were conducted using mlr [16] and R [20]. The code for all experiments is available at

Acknowledgements. This work is funded by the Bavarian State Ministry of Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B) and supported by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.


Appendix 0.A Decompose cPFI into subgroup PFIs

Assuming a perfect construction of , it holds that and also that (as is a compression of ). Therefore


When we sample the replacement for an from the marginal within a group (, e.g. via permutation) we also sample from the conditional . Every datapoint from the global sample can therefore equivalently be seen as a sample from the marginal within the group, or as a sample from the global conditional.
As follows, the weighted sum of marginal subgroup PFIs coincides with the cPFI.


Appendix 0.B Decompose cPDPs into subgroup PDPs

As we know that we see that the conditional PDP as defined below can be seen as a point wise local sum of marginal PDPs. We denote that we consider a conditional distribution with by denoting at the end of a line.


We need

in order to construct the cPDP from the group-wise PDPs. These probabilities can be approximated, but cannot be trivially derived analytically, as they depend on the unknown distributions of

within the groups.
If we would construct the groups by partitioning of (and modelling of the conditional of ), the term would evaluate to or and the aggregation would be straight-forward to perform.

Appendix 0.C Expectation and Variance of the PDP in a Subgroup

We show that under feature independence the PDP and a PDP in an arbitrary subgroup have the same expected value and the subgroup PDP has a higher variance.


Appendix 0.D Groundtruth cPFI

The data has the following distribution:

And with . We assume an ML model For any data point, the squared loss is:

The expectation for this is:

For the permutation feature importance, we permute one of the features. The following formula shows permutation of :


So for a single data point, the feature importance is:

The expected value of this is:


Conditional on observed ,

follows a Gaussian distribution:

Finally the expected conditional permutation feature importance becomes:


Appendix 0.E Model Fidelity plots

Figure 10: Comparing the loss between model f and various feature effect methods. Each instance in the boxplot is MSE for one feature, summed over the test data.