Random Forests, Decision Trees, and Categorical Predictors: The "Absent Levels" Problem

06/12/2017
by   Timothy C. Au, et al.
0

One of the advantages that decision trees have over many other models is their ability to natively handle categorical predictors without having to first transform them (e.g., by using one-hot encoding). However, in this paper, we show how this capability can also lead to an inherent "absent levels" problem for decision tree based algorithms that, to the best of our knowledge, has never been thoroughly discussed, and whose consequences have never been carefully explored. This predicament occurs whenever there is indeterminacy in how to handle an observation that has reached a categorical split which was determined when the observation's level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler's random forests FORTRAN code and the randomForest R package as motivating case studies, we show how overlooking the absent levels problem can systematically bias a model. Afterwards, we discuss some heuristics that can possibly be used to help mitigate the absent levels problem and, using three real data examples taken from public repositories, we demonstrate the superior performance and reliability of these heuristics over some of the existing approaches that are currently being employed in practice due to oversights in the software implementations of decision tree based algorithms. Given how extensively these algorithms have been used, it is conceivable that a sizable number of these models have been unknowingly and seriously affected by this issue---further emphasizing the need for the development of both theory and software that accounts for the absent levels problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2020

Exploiting Categorical Structure Using Tree-Based Methods

Standard methods of using categorical variables as predictors either end...
research
11/08/2022

A new BART prior for flexible modeling with categorical predictors

Default implementations of Bayesian Additive Regression Trees (BART) rep...
research
02/05/2022

Backtrack Tie-Breaking for Decision Trees: A Note on Deodata Predictors

A tie-breaking method is proposed for choosing the predicted class, or o...
research
07/02/2021

Decision tree heuristics can fail, even in the smoothed setting

Greedy decision tree learning heuristics are mainstays of machine learni...
research
10/18/2021

A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds

Decision trees are important both as interpretable models amenable to hi...
research
10/16/2021

Streaming Decision Trees and Forests

Machine learning has successfully leveraged modern data and provided com...
research
02/24/2020

Trees and Forests in Nuclear Physics

We present a detailed introduction to the decision tree algorithm using ...

Please sign up or login with your details

Forgot password? Click here to reset