MDI+: A Flexible Random Forest-Based Feature Importance Framework

by   Abhineet Agarwal, et al.

Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature X_k in each tree in an RF is equivalent to the unnormalized R^2 value in a linear regression of the response on the collection of decision stumps that split on X_k. We use this interpretation to propose a flexible feature importance framework called MDI+. Specifically, MDI+ generalizes MDI by allowing the analyst to replace the linear regression model and R^2 metric with regularized generalized linear models (GLMs) and metrics better suited for the given data structure. Moreover, MDI+ incorporates additional features to mitigate known biases of decision trees against additive or smooth models. We further provide guidance on how practitioners can choose an appropriate GLM and metric based upon the Predictability, Computability, Stability framework for veridical data science. Extensive data-inspired simulations show that MDI+ significantly outperforms popular feature importance measures in identifying signal features. We also apply MDI+ to two real-world case studies on drug response prediction and breast cancer subtype classification. We show that MDI+ extracts well-established predictive genes with significantly greater stability compared to existing feature importance measures. All code and models are released in a full-fledged python package on Github.


ggRandomForests: Visually Exploring a Random Forest for Regression

Random Forests [Breiman:2001] (RF) are a fully non-parametric statistica...

Understanding Global Feature Contributions Through Additive Importance Measures

Understanding the inner workings of complex machine learning models is a...

Feature Importance Measurement based on Decision Tree Sampling

Random forest is effective for prediction tasks but the randomness of tr...

Dimension Reduction Forests: Local Variable Importance using Structured Random Forests

Random forests are one of the most popular machine learning methods due ...

Interpreting Deep Forest through Feature Contribution and MDI Feature Importance

Deep forest is a non-differentiable deep model which has achieved impres...

MABSplit: Faster Forest Training Using Multi-Armed Bandits

Random forests are some of the most widely used machine learning models ...

Unbiased Measurement of Feature Importance in Tree-Based Methods

We propose a modification that corrects for split-improvement variable i...

Please sign up or login with your details

Forgot password? Click here to reset