A Stratification Approach to Partial Dependence for Codependent Variables

07/15/2019
by   Terence Parr, et al.
5

Model interpretability is important to machine learning practitioners, and a key component of interpretation is the characterization of partial dependence of the response variable on any subset of features used in the model. The two most common strategies for assessing partial dependence suffer from a number of critical weaknesses. In the first strategy, linear regression model coefficients describe how a unit change in an explanatory variable changes the response, while holding other variables constant. But, linear regression is inapplicable for high dimensional (p>n) data sets and is often insufficient to capture the relationship between explanatory variables and the response. In the second strategy, Partial Dependence (PD) plots and Individual Conditional Expectation (ICE) plots give biased results for the common situation of codependent variables and they rely on fitted models provided by the user. When the supplied model is a poor choice due to systematic bias or overfitting, PD/ICE plots provide little (if any) useful information. To address these issues, we introduce a new strategy, called StratPD, that does not depend on a user's fitted model, provides accurate results in the presence codependent variables, and is applicable to high dimensional settings. The strategy works by stratifying a data set into groups of observations that are similar, except in the variable of interest, through the use of a decision tree. Any fluctuations of the response variable within a group is likely due to the variable of interest. We apply StratPD to a collection of simulations and case studies to show that StratPD is a fast, reliable, and robust method for assessing partial dependence with clear advantages over state-of-the-art methods.

READ FULL TEXT
research
05/16/2019

A DFA-based bivariate regression model for estimating the dependence of PM2.5 among neighbouring cities

On the basis of detrended fluctuation analysis (DFA), we propose a new b...
research
08/09/2021

Visualizing Variable Importance and Variable Interaction Effects in Machine Learning Models

Variable importance, interaction measures, and partial dependence plots ...
research
02/26/2018

Partial Distance Correlation Screening for High Dimensional Time Series

High dimensional time series datasets are becoming increasingly common i...
research
10/30/2022

Prediction Sets for High-Dimensional Mixture of Experts Models

Large datasets make it possible to build predictive models that can capt...
research
07/02/2020

Floodgate: inference for model-free variable importance

Many modern applications seek to understand the relationship between an ...
research
06/11/2020

Modeling high-dimensional dependence among astronomical data

Fixing the relationship among a set of experimental quantities is a fund...
research
07/07/2023

Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data

Feature-distributed data, referred to data partitioned by features and s...

Please sign up or login with your details

Forgot password? Click here to reset