Efficient nonparametric statistical inference on population feature importance using Shapley values

06/16/2020
by   Brian D. Williamson, et al.
0

The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical inference on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the Shapley Population Variable Importance Measure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only Θ(n) feature subsets given n observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/07/2020

A unified approach for inference on algorithm-agnostic variable importance

In many applications, it is of interest to assess the relative contribut...
research
06/19/2019

Frequentist Inference without Repeated Sampling

Frequentist inference typically is described in terms of hypothetical re...
research
11/23/2022

Shapley Curves: A Smoothing Perspective

Originating from cooperative game theory, Shapley values have become one...
research
06/07/2023

Using Large Language Model Annotations for Valid Downstream Statistical Inference in Social Science: Design-Based Semi-Supervised Learning

In computational social science (CSS), researchers analyze documents to ...
research
12/10/2022

Nonparametric inference about increasing odds rate distributions

To improve nonparametric estimates of lifetime distributions, we propose...
research
09/07/2023

Total Variation Floodgate for Variable Importance Inference in Classification

Inferring variable importance is the key problem of many scientific stud...
research
06/16/2020

A Goodness-of-Fit Test for Statistical Models

Statistical modeling plays a fundamental role in understanding the under...

Please sign up or login with your details

Forgot password? Click here to reset