COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

03/10/2011
by   Justin D. Basilico, et al.
0

COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2018

A Dynamic Boosted Ensemble Learning Method Based on Random Forest

We propose a dynamic boosted ensemble learning method based on random fo...
research
05/11/2018

Distributed Deep Forest and its Application to Automatic Detection of Cash-out Fraud

Internet companies are facing the need of handling large scale machine l...
research
07/11/2019

Fitting Prediction Rule Ensembles to Psychological Research Data: An Introduction and Tutorial

Prediction rule ensembles (PREs) are a relatively new statistical learni...
research
04/07/2020

Feature Partitioning for Robust Tree Ensembles and their Certification in Adversarial Scenarios

Machine learning algorithms, however effective, are known to be vulnerab...
research
12/13/2014

Oriented Edge Forests for Boundary Detection

We present a simple, efficient model for learning boundary detection bas...
research
03/19/2019

Random Pairwise Shapelets Forest

Shapelet is a discriminative subsequence of time series. An advanced sha...
research
07/29/2017

KNN Ensembles for Tweedie Regression: The Power of Multiscale Neighborhoods

Very few K-nearest-neighbor (KNN) ensembles exist, despite the efficacy ...

Please sign up or login with your details

Forgot password? Click here to reset