Understanding Random Forests: From Theory to Practice

07/28/2014
by   Gilles Louppe, et al.
0

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyse and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances [...].

READ FULL TEXT

page 22

page 23

page 26

page 30

page 37

page 39

research
01/13/2020

Trees, forests, and impurity-based variable importance

Tree ensemble methods such as random forests [Breiman, 2001] are very po...
research
11/15/2007

Variable importance in binary regression trees and forests

We characterize and study variable importance (VIMP) and pairwise variab...
research
02/26/2021

MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

Variable importance measures are the main tools to analyze the black-box...
research
05/25/2021

SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Interpretability of learning algorithms is crucial for applications invo...
research
06/17/2021

Importance measures derived from random forests: characterisation and extension

Nowadays new technologies, and especially artificial intelligence, are m...
research
03/03/2020

Understanding the Prediction Mechanism of Sentiments by XAI Visualization

People often rely on online reviews to make purchase decisions. The pres...
research
11/20/2019

LionForests: Local Interpretation of Random Forests through Path Selection

Towards a future where machine learning systems will integrate into ever...

Please sign up or login with your details

Forgot password? Click here to reset