Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process

09/03/2021
by   Christoph Molnar, et al.
1

Scientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. However, their model parameters usually cannot be easily related to the data generating process. To learn about the modeled relationships, partial dependence (PD) plots and permutation feature importance (PFI) are often used as interpretation methods. However, PD and PFI lack a theory that relates them to the data generating process. We formalize PD and PFI as statistical estimators of ground truth estimands rooted in the data generating process. We show that PD and PFI estimates deviate from this ground truth due to statistical biases, model variance and Monte Carlo approximation errors. To account for model variance in PD and PFI estimation, we propose the learner-PD and the learner-PFI based on model refits, and propose corrected variance and confidence interval estimators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2020

Model-agnostic Feature Importance and Effects with Dependent Features – A Conditional Subgroup Approach

Partial dependence plots and permutation feature importance are popular ...
research
09/24/2018

Statistical Estimation of Malware Detection Metrics in the Absence of Ground Truth

The accurate measurement of security metrics is a critical research prob...
research
05/01/2019

Please Stop Permuting Features: An Explanation and Alternatives

This paper advocates against permute-and-predict (PaP) methods for inter...
research
11/14/2021

Scrutinizing XAI using linear ground-truth data with suppressor variables

Machine learning (ML) is increasingly often used to inform high-stakes d...
research
10/22/2016

Multitask Learning of Vegetation Biochemistry from Hyperspectral Data

Statistical models have been successful in accurately estimating the bio...
research
09/11/2020

Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems using Feature Importance Fusion

When machine learning supports decision-making in safety-critical system...
research
06/30/2016

A Permutation-based Model for Crowd Labeling: Optimal Estimation and Robustness

The aggregation and denoising of crowd labeled data is a task that has g...

Please sign up or login with your details

Forgot password? Click here to reset