Exact Distributed Training: Random Forest with Billions of Examples

04/18/2018
by   Mathieu Guillame-Bert, et al.
0

We introduce an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search. We explain the proposed algorithm and compare it to related approaches for various complexity measures (time, ram, disk, and network complexity analysis). We report its running performances on artificial and real-world datasets of up to 18 billions examples. This figure is several orders of magnitude larger than datasets tackled in the existing literature. Finally, we empirically show that Random Forest benefits from being trained on more data, even in the case of already gigantic datasets. Given a dataset with 17.3B examples with 82 features (3 numerical, other categorical with high arity), our implementation trains a tree in 22h.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/02/2019

A note on the consistency of the random forest algorithm

Examples are given of data-generating models for which Breiman's random ...
research
09/11/2020

DART: Data Addition and Removal Trees

How can we update data for a machine learning model after it has already...
research
01/25/2022

Model Generalization in Arrival Runway Occupancy Time Prediction by Feature Equivalences

General real-time runway occupancy time prediction modelling for multipl...
research
03/07/2020

Getting Better from Worse: Augmented Bagging and a Cautionary Tale of Variable Importance

As the size, complexity, and availability of data continues to grow, sci...
research
07/23/2022

Unstructured Road Segmentation using Hypercolumn based Random Forests of Local experts

Monocular vision based road detection methods are mostly based on machin...
research
03/29/2016

Nine Features in a Random Forest to Learn Taxonomical Semantic Relations

ROOT9 is a supervised system for the classification of hypernyms, co-hyp...
research
02/18/2023

Reproducing Random Forest Efficacy in Detecting Port Scanning

Port scanning is the process of attempting to connect to various network...

Please sign up or login with your details

Forgot password? Click here to reset