Random Forests for Big Data

11/26/2015
by   Robin Genuer, et al.
0

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.

READ FULL TEXT

page 34

page 39

research
06/25/2019

AMF: Aggregated Mondrian Forests for Online Learning

Random Forests (RF) is one of the algorithms of choice in many supervise...
research
05/12/2018

A Simple and Effective Model-Based Variable Importance Measure

In the era of "big data", it is becoming more of a challenge to not only...
research
06/15/2019

Online Heterogeneous Mixture Learning for Big Data

We propose the online machine learning for big data analysis with hetero...
research
06/04/2019

Fréchet random forests

Random forests are a statistical learning method widely used in many are...
research
08/01/2019

Optimize TSK Fuzzy Systems for Big Data Classification Problems: Bag of Tricks

Takagi-Sugeno-Kang (TSK) fuzzy systems are flexible and interpretable ma...
research
04/05/2018

Robust Fusion Methods for Structured Big Data

We address one of the important problems in Big Data, namely how to comb...

Please sign up or login with your details

Forgot password? Click here to reset