A Sharp Bound on the Computation-Accuracy Tradeoff for Majority Voting Ensembles

03/04/2013
by   Miles E. Lopes, et al.
0

When random forests are used for binary classification, an ensemble of t=1,2,... randomized classifiers is generated, and the predictions of the classifiers are aggregated by majority vote. Due to the randomness in the algorithm, there is a natural tradeoff between statistical performance and computational cost. On one hand, as t increases, the (random) prediction error of the ensemble tends to decrease and stabilize. On the other hand, larger ensembles require greater computational cost for training and making new predictions. The present work offers a new approach for quantifying this tradeoff: Given a fixed training set D, let the random variables Err_t,0 and Err_t,1 denote the class-wise prediction error rates of a randomly generated ensemble of size t. As t→∞, we provide a general bound on the "algorithmic variance", var(Err_t,l|D)≤f_l(1/2)^2/4t+o(1/t), where l∈{0,1}, and f_l is a density function that arises from the ensemble method. Conceptually, this result is somewhat surprising, because var(Err_t,l|D) describes how Err_t,l varies over repeated runs of the algorithm, and yet, the formula leads to a method for bounding var(Err_t,l|D) with a single ensemble. The bound is also sharp in the sense that it is attained by an explicit family of randomized classifiers. With regard to the task of estimating f_l(1/2), the presence of the ensemble leads to a unique twist on the classical setup of non-parametric density estimation --- wherein the effects of sample size and computational cost are intertwined. In particular, we propose an estimator for f_l(1/2), and derive an upper bound on its MSE that matches "standard optimal non-parametric rates" when t is sufficiently large.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2019

Estimating the Algorithmic Variance of Randomized Ensembles via the Bootstrap

Although the methods of bagging and random forests are some of the most ...
research
04/23/2021

Selecting a number of voters for a voting ensemble

For a voting ensemble that selects an odd-sized subset of the ensemble c...
research
02/15/2022

Binary Classification for High Dimensional Data using Supervised Non-Parametric Ensemble Method

Medical Research data used for prognostication deals with binary classif...
research
11/15/2017

Efficient Estimation of Generalization Error and Bias-Variance Components of Ensembles

For many applications, an ensemble of base classifiers is an effective s...
research
02/27/2023

Extrapolated cross-validation for randomized ensembles

Ensemble methods such as bagging and random forests are ubiquitous in fi...
research
06/07/2019

On the Current State of Research in Explaining Ensemble Performance Using Margins

Empirical evidence shows that ensembles, such as bagging, boosting, rand...
research
07/17/2018

A Mathematical Model for Optimal Decisions in a Representative Democracy

Direct democracy is a special case of an ensemble of classifiers, where ...

Please sign up or login with your details

Forgot password? Click here to reset