Accelerated and interpretable oblique random survival forests

08/01/2022
by   Byron C. Jaeger, et al.
0

The oblique random survival forest (RSF) is an ensemble supervised learning method for right-censored outcomes. Trees in the oblique RSF are grown using linear combinations of predictors to create branches, whereas in the standard RSF, a single predictor is used. Oblique RSF ensembles often have higher prediction accuracy than standard RSF ensembles. However, assessing all possible linear combinations of predictors induces significant computational overhead that limits applications to large-scale data sets. In addition, few methods have been developed for interpretation of oblique RSF ensembles, and they remain more difficult to interpret compared to their axis-based counterparts. We introduce a method to increase computational efficiency of the oblique RSF and a method to estimate importance of individual predictor variables with the oblique RSF. Our strategy to reduce computational overhead makes use of Newton-Raphson scoring, a classical optimization technique that we apply to the Cox partial likelihood function within each non-leaf node of decision trees. We estimate the importance of individual predictors for the oblique RSF by negating each coefficient used for the given predictor in linear combinations, and then computing the reduction in out-of-bag accuracy. In general benchmarking experiments, we find that our implementation of the oblique RSF is approximately 450 times faster with equivalent discrimination and superior Brier score compared to existing software for oblique RSFs. We find in simulation studies that 'negation importance' discriminates between relevant and irrelevant predictors more reliably than permutation importance, Shapley additive explanations, and a previously introduced technique to measure variable importance with oblique RSFs based on analysis of variance. Methods introduced in the current study are available in the aorsf R package.

READ FULL TEXT
research
11/06/2015

Finding structure in data using multivariate tree boosting

Technology and collaboration enable dramatic increases in the size of ps...
research
10/10/2019

The Implicit Regularization of Ordinary Least Squares Ensembles

Ensemble methods that average over a collection of independent predictor...
research
08/11/2021

Repeated undersampling in PrInDT (RePrInDT): Variation in undersampling and prediction, and ranking of predictors in ensembles

In this paper, we extend our PrInDT method (Weihs Buschfeld 2021a) t...
research
05/12/2018

A Simple and Effective Model-Based Variable Importance Measure

In the era of "big data", it is becoming more of a challenge to not only...
research
08/31/2022

The Infinitesimal Jackknife and Combinations of Models

The Infinitesimal Jackknife is a general method for estimating variances...
research
04/28/2021

Universal Consistency of Decision Trees in High Dimensions

This paper shows that decision trees constructed with Classification and...
research
06/25/2020

Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Automated machine learning (AutoML) can produce complex model ensembles ...

Please sign up or login with your details

Forgot password? Click here to reset