Conditional Feature Importance for Mixed Data

10/06/2022
by   Kristin Blesch, et al.
0

Despite the popularity of feature importance measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analyzing a variable's importance before and after adjusting for covariates - i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. Further, we reveal that for testing conditional feature importance (CFI), only few methods are available and practitioners have hitherto been severely restricted in method application due to mismatching data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical data (mixed data). Both properties are oftentimes neglected by CFI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework (arXiv:1901.09917) with sequential knockoff sampling (arXiv:2010.14026). The CPI enables CFI measurement that controls for any feature dependencies by sampling valid knockoffs - hence, generating synthetic data with similar statistical properties - for the data to be analyzed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power and is in line with results given by other CFI measures, whereas marginal feature importance metrics result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

READ FULL TEXT
research
09/14/2023

Statistically Valid Variable Importance Assessment through Conditional Permutations

Variable importance assessment has become a crucial step in machine-lear...
research
12/21/2018

Marginal and Conditional Multiple Inference in Linear Mixed Models

This work introduces a general framework for multiple inference in linea...
research
04/08/2019

Sampling, Intervention, Prediction, Aggregation: A Generalized Framework for Model Agnostic Interpretations

Non-linear machine learning models often trade off a great predictive pe...
research
04/21/2022

Ultra-marginal Feature Importance

Scientists frequently prioritize learning from data rather than training...
research
07/29/2021

Temporal Dependencies in Feature Importance for Time Series Predictions

Explanation methods applied to sequential models for multivariate time s...
research
09/15/2022

Private Synthetic Data for Multitask Learning and Marginal Queries

We provide a differentially private algorithm for producing synthetic da...
research
06/26/2023

The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

We study universal traits which emerge both in real-world complex datase...

Please sign up or login with your details

Forgot password? Click here to reset