Estimating the Algorithmic Variance of Randomized Ensembles via the Bootstrap

07/20/2019 ∙ by Miles E. Lopes, et al. ∙ University of California-Davis 0

Although the methods of bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence. In particular, there are not many theoretical guarantees for deciding when an ensemble is "large enough" --- so that its accuracy is close to that of an ideal infinite ensemble. Due to the fact that bagging and random forests are randomized algorithms, the choice of ensemble size is closely related to the notion of "algorithmic variance" (i.e. the variance of prediction error due only to the training algorithm). In the present work, we propose a bootstrap method to estimate this variance for bagging, random forests, and related methods in the context of classification. To be specific, suppose the training dataset is fixed, and let the random variable Err_t denote the prediction error of a randomized ensemble of size t. Working under a "first-order model" for randomized ensembles, we prove that the centered law of Err_t can be consistently approximated via the proposed method as t→∞. Meanwhile, the computational cost of the method is quite modest, by virtue of an extrapolation technique. As a consequence, the method offers a practical guideline for deciding when the algorithmic fluctuations of Err_t are negligible.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Random forests and bagging are some of the most widely used prediction methods (Breiman, 1996, 2001), and over the course of the past fifteen years, much progress has been made in analyzing their statistical performance (Bühlmann and Yu, 2002; Hall and Samworth, 2005; Biau, Devroye and Lugosi, 2008; Biau, 2012; Scornet, Biau and Vert, 2015). However, from a computational perspective, relatively little is understood about the algorithmic convergence of these methods, and in practice, ad hoc criteria are generally used to assess this convergence.

To clarify the idea of algorithmic convergence, recall that when bagging and random forests are used for classification, a large collection of

randomized classifiers is trained, and then new predictions are made by taking the plurality vote of the classifiers. If such a method is run several times on the same training data

, the prediction error of the ensemble will vary with each run, due to the randomized training algorithm. As the ensemble size increases with held fixed, the random variable typically decreases and eventually stabilizes at a limiting value . In this way, an ensemble reaches algorithmic convergence when its prediction error nearly matches that of an infinite ensemble trained on the same data.

Meanwhile, with regard to computational cost, larger ensembles are more expensive to train, to store in memory, and to evaluate on unlabeled points. For this reason, it is desirable to have a quantitative guarantee that an ensemble of a given size will perform nearly as well as an infinite one. This type of guarantee also prevents wasted computation, and assures the user that extra classifiers are unlikely to yield much improvement in accuracy.

1.1 Contributions and related work

To measure algorithmic convergence, we propose a new bootstrap method for approximating the distribution as . Such an approximation allows the user to decide when the algorithmic fluctuations of around are negligible. If particular, if we refer to the algorithmic variance

as the variance of due only the training algorithm, then the parameter is a concrete measure of convergence that can be estimated via the bootstrap. In addition, the computational cost of the method turns out to be quite modest, by virtue of an extrapolation technique, as described in Section 4.

Although the bootstrap is an established approach to distributional approximation and variance estimation, our work applies the bootstrap in a relatively novel way. Namely, the method is based on “bootstrapping an algorithm”, rather than “bootstrapping data” — and in essence, we are applying an inferential method in order to serve a computational purpose. The opportunities for applying this perspective to other randomized algorithms can also be seen in the papers Byrd et al. (2012); Lopes, Wang and Mahoney (2017, 2018), which deal with stochastic gradient methods, as well as randomized versions of matrix multiplication and least-squares.

Bootstrap consistency

From a theoretical standpoint, our main result (Theorem 3.1) shows that the proposed method consistently approximates the distribution as under a “first-order model” for randomized ensembles. The proof also offers a couple of theoretical contributions related to Hadamard differentiability and the functional delta method (van der Vaart and Wellner, 1996, Chapter 3.9). The first ingredient is a lifting operator , which transforms a univariate empirical c.d.f. into a multivariate analogue , where is a simplex. In addition to having interesting properties in its own right, the lifting operator will allow us to represent as a functional of an empirical process. The second ingredient is the calculation of this functional’s Hadamard derivative, which leads to a surprising connection with the classical first variation formula for smooth manifolds (Simon, 1983; White, 2016).111Further examples of problems where geometric analysis plays a role in understanding the performance of numerical algorithms may be found in the book Bürgisser and Cucker (2013).

To briefly comment on the role of this formula in our analysis, consider the following informal statement of it. Let be a smooth -dimensional manifold contained in , and let be a one-parameter family of diffeomorphisms , satisfying as , where denotes the identity map on . Then,

(1.1)

where is a volume measure, the symbol

denotes the divergence of the vector field

, and the symbol is a volume element on . In our analysis, it is necessary to adapt this result to a situation where the maps are non-smooth, the manifold is a non-smooth subset of Euclidean space, and the vector field is a non-smooth Gaussian process. Furthermore, applying a version of Stokes’ theorem to the right side of equation (1.1) leads to a particular linear functional of , which turns out to be the Hadamard derivative relevant to understanding . A more detailed explanation of this connection is given below equation (B.1) in Appendix B.

Related work

In the setting of binary classification, a few papers analyze the bias , and show that it converges at the fast rate of under various conditions (Ng and Jordan, 2001; Lopes, 2016; Cannings and Samworth, 2017). A couple of other works study alternative measures of convergence. For instance, the paper Lam and Suen (1997)

considers the probability that the majority vote commits an error at a fixed test point, and the paper 

Hernández-Lobato, Martínez-Muñoz and Suárez (2013) provides an informal analysis of the probability that an ensemble of size disagrees with an infinite ensemble at a random test point, but these approaches do not directly control . In addition, some empirical studies of algorithmic convergence may be found in the papers Oshiro, Perez and Baranauskas (2012); Latinne, Debeir and Decaestecker (2001).

Among the references just mentioned, the ones that are most closely related to the current paper are Lopes (2016) and Cannings and Samworth (2017). These works derive theoretical upper bounds on or , where is the error rate on a particular class (cf. Section 4). The paper Lopes (2016) also proposes a method to estimate the unknown parameters in such bounds. In relation to these works, the current paper differs in two significant ways. First, we offer an approximation to the full distribution , and hence provide a direct estimate of algorithmic variance, rather than a bound. Second, the method proposed here is relevant to a wider range of problems, since it can handle any number of classes, whereas the analyses in Lopes (2016) and Cannings and Samworth (2017) are specialized to the binary setting. Moreover, the theoretical analysis of the bootstrap approach is entirely different from the previous techniques used in deriving variance bounds.

Outside of the setting of randomized ensemble classifiers, the papers Sexton and Laake (2009); Arlot and Genuer (2014); Wager, Hastie and Efron (2014); Mentch and Hooker (2016); Scornet (2016a) look at the algorithmic fluctuations of ensemble regression functions at a fixed test point.

1.2 Background and setup

We consider the general setting of a classification problem with classes. The set of training data is denoted , which is contained in a sample space . The feature space is arbitrary, and the space of labels has cardinality . An ensemble of classifiers is denoted by , with .

Randomized ensembles

The key issue in studying the algorithmic convergence of bagging and random forests is randomization. In the method of bagging, randomization is introduced by generating random sets , each of size , via sampling with replacement from . For each , a classifier is trained on , with the same classification method being used each time. When each

is trained with a decision tree method (such as CART 

(Breiman et al., 1984)

), the random forests procedure extends bagging by adding a randomized feature selection rule 

(Breiman, 2001).

It is helpful to note that the classifiers in bagging and random forests can be represented in a common way. Namely, there is a deterministic function, say , such that for any fixed , each classifier can be written as

(1.2)

where , is an i.i.d. sequence of random objects, independent of , that specify the “randomizing parameters” of the classifiers (cf. Breiman (2001, Definition 1.1)). For instance, in the case of bagging, the object specifies the randomly chosen points in .

Beyond bagging and random forests, our proposed method will be generally applicable to ensembles that can be represented in the form (1.2), such as those in Ho (1998); Dietterich (2000); Bühlmann and Yu (2002). This representation should be viewed abstractly, and it is not necessary for the function or the objects to be explicitly constructed in practice. Some further examples include a recent ensemble method based on random projections (Cannings and Samworth, 2017), as well as the voting Gibbs classifier (Ng and Jordan, 2001), which is a Bayesian ensemble method based on posterior sampling. More generally, if the functions are i.i.d. conditionally on , then the ensemble can be represented in the form (1.2), as long as the classifiers lie in a standard Borel space (Kallenberg, 2006, Lemma 3.22). Lastly, it is important to note that the representation (1.2) generally does not hold for classifiers generated by boosting methods (Schapire and Freund, 2012), for which the analysis of algorithmic convergence is quite different.

Plurality vote

For any , we define the ensemble’s plurality vote as the label receiving the largest number of votes among . In the exceptional case of a tie, it will simplify technical matters to define the plurality vote as a symbol not contained in , so that a tie always counts as an error. We also use the labeling scheme, , where , and is the th standard basis vector for . One benefit of this scheme is that the plurality vote is determined by the average of the labels, . For this reason, we denote the plurality vote as .

Error rate

Let denote the distribution of a test point in , drawn independently of and . Then, for a particular realization of the classifiers , trained with the given set , the prediction error rate is defined as

(1.3)

where . (Class-wise error rates , with will also be addressed in Section 4.1.) Here, it is crucial to note that is a random variable, since is a random function. Indeed, the integral above shows that is a functional of . Moreover, there are two sources of randomness to consider: the algorithmic randomness arising from , and the randomness arising from the training set . Going forward, we will focus on the algorithmic fluctuations of due to , and our analysis will always be conditional on .

Algorithmic variance

At first sight, it might not be obvious how to interpret the algorithmic fluctuations of when is held fixed. These fluctuations are illustrated below in Figure 1. The left panel shows how changes as decision trees are added incrementally during a single run of the random forests algorithm. For the purposes of illustration, if we run the algorithm repeatedly on to train many ensembles, we obtain a large number of sample paths of as a function of , shown in the right panel. Averaging the sample paths at each value of produces the red curve, representing with averaged out. Furthermore, the blue envelope curves for the sample paths are obtained by plotting as a function of .

Figure 1: Left panel: The fluctuations of for a single run of random forests on the “nursery data” (cf. Section 5). Right panel: The fluctuations of for 1,000 runs of random forests on the same data (i.e. different realizations of the ensemble ). For each ensemble, the value was approximated by testing on the set of points denoted , as described in Section 5.
Problem formulation

Recall that the value represents the ideal prediction error of an infinite ensemble trained on . Hence, a natural way of defining algorithmic convergence is to say that it occurs when is large enough so that the condition holds with high probability, conditionally on , for some user-specified tolerance . However, the immediate problem we face is that it is not obvious how to check such a condition in practice.

From the right panel of Figure 1, we see that for most , the inequality is highly likely to hold — and this observation can be formalized using Theorem 3.1 later on. For this reason, we propose to estimate

as a route to measuring algorithmic convergence. It is also important to note that estimating the quantiles of

would serve the same purpose, but for the sake of simplicity, we will focus on . In particular, there are at least two ways that an estimate can be used in practice:

  1. Checking convergence for a given ensemble. If an ensemble of a given size has been trained, then convergence can be checked by asking whether or not the observable condition holds. Additional comments on possible choices for will be given shortly.

  2. Selecting dynamically. In order to make the training process as computationally efficient as possible, it is desirable to select the smallest needed so that is likely to hold. It turns out that this can be accomplished using an extrapolation technique, due to the fact that tends to scale like (cf. Theorem 3.1). More specifically, if the user trains a small initial ensemble of size and computes an estimate , then “future” values of for can be estimated at no additional cost with the re-scaled estimate . In other words, it is possible to look ahead and predict how many additional classifiers are needed to achieve . Additional details are given in Section 4.2.

Sources of difficulty in estimating

Having described the basic formulation of the problem, it is important to identify what challenges are involved in estimating . First, we must keep in mind that the parameter describes how fluctuates over repeated ensembles generated from — and so it is not obvious that it is possible to estimate from the output of a single ensemble. Second, the computational cost to estimate should not outweigh the cost of training the ensemble, and consequently, the proposed method should be computationally efficient. These two obstacles will be described in Sections 3.2 and 4.2 respectively.

Remarks on the choice of error rate

Instead of analyzing the random variable , a natural inclination would be to consider the “unconditional” error rate , which is likely more familiar, and reflects averaging over both and the randomized algorithm. Nevertheless, there are a few reasons why the “conditional” error may be more suitable for analyzing algorithmic convergence. First, the notion of algorithmic convergence typically describes how much computation should be applied to a given input — and in our context, the given input for the training algorithm is . Second, from an operational standpoint, once a user has trained an ensemble on a given dataset, their actual probability of misclassifying a future test point is , rather than .

There is also a sense in which the convergence of may be misleading. Existing theoretical results suggest that converges at the fast rate of , and in the binary case, , it can be shown that , for some number , under certain conditions (Lopes, 2016; Cannings and Samworth, 2017). However, in Theorem 1 of Section 3, we show that conditionally on , the difference has fluctuations of order . In this sense, if the choice of is guided only by (rather than the fluctuations of ), then the user may be misled into thinking that algorithmic convergence occurs much faster than it really does — and this distinction is apparent from the red and blue curves in Figure 1.

Outline

Our proposed bootstrap method is described in Section 2, and our main consistency result is given in Section 3. Practical considerations are discussed in Section 4, numerical experiments are given in Section 5, and conclusions are stated in Section 6. The essential ideas of the proofs are explained in Appendices A and B, while the technical arguments are given Appendices C-E. Lastly, in Appendix F, we provide additional assessment of technical assumptions. All appendices are in the supplementary material.

2 Method

Based on the definition of in equation (1.3), we may view as a functional of , denoted

From a statistical standpoint, the importance of this expression is that is a functional of a sample mean, which makes it plausible that is amenable to bootstrapping, provided that is sufficiently smooth.

To describe the bootstrap method, let denote a random sample with replacement from the trained ensemble , and put In turn, it would be natural to regard the quantity

(2.1)

as a bootstrap sample of , but strictly speaking, this is an “idealized” bootstrap sample, because the functional depends on the unknown test point distribution . Likewise, in Section 2.1 below, we explain how each value can be estimated. So, in other words, if denotes an estimate of , then an estimate of would be written as

and the corresponding bootstrap sample is

Altogether, a basic version of the proposed bootstrap algorithm is summarized as follows.

Algorithm 1 (Bootstrap estimation of )

  
For

  • Sample classifiers with replacement from .

  • Compute .

Return:

the sample standard deviation of

, denoted .
  

Remark

While the above algorithm is conceptually simple, it suppresses most of the implementation details, and these are explained below. Also note that in order to approximate quantiles of , rather than , it is only necessary to modify the last step, by returning the desired quantile of the centered values , with .

2.1 Resampling algorithm with hold-out or “out-of-bag” points

Here, we consider a version of Algorithm 1 where is estimated implicitly with hold-out points, and then later on, we will explain how hold-out points can be avoided using “out-of-bag” (oob) points. To begin, suppose we have a hold-out set of size , denoted . Next, consider an array of size , whose th row is given by the predicted labels of on the hold-out points. That is,

(2.2)

and

(2.3)

The estimated error rate is easily computed as a function of this array, i.e. . To spell out the details, let ) denote the th column of , and with a slight abuse of our earlier notation, let denote the plurality vote of the labels in . Then, the estimated error rate is obtained from simple column-wise operations on ,

(2.4)

In other words, is just the proportion of columns of for which plurality vote is incorrect. (Note that equation (2.4) is where is implicitly estimated.) Finally, since there is a one-to-one correspondence between the rows and the classifiers , the proposed method is equivalent to resampling the rows , as given below.

Algorithm 2 (Bootstrap estimation of with hold-out points)

  
For

  • Draw a array whose rows are sampled with replacement from .

  • Compute .

Return: the sample standard deviation of , denoted .

  

Extension to OOB points

Since the use of a hold-out set is often undesirable in practice, we instead consider oob points — which are a special feature of bagging and random forests. To briefly review this notion, recall that each classifier is trained on a set of points obtained by sampling with replacement from . Consequently, each set excludes approximately 37% of the points in , and these excluded points may be used as test points for the particular classifier . If a point is excluded from , then we say “the point is oob for the classifer ”, and we write , where the set indexes the classifiers for which is oob.

In this notation, the error estimate in Algorithm 2 can be given an analogous definition in terms of oob points. Define a new array whose th row is given by

Next, letting be the th column of , define to be the plurality vote of the set of labels . (If this set of labels is empty, then we treat this case as a tie, but this is unimportant, since it occurs with probability .) So, by analogy with , we define

(2.5)

Hence, the oob version of Algorithm 2 may be implemented by simply interchanging and , as well as and . The essential point to notice is that the sum in equation (2.5) is now over the training points in , rather than over the hold-out set , as in equation (2.4).

3 Main result

Our main theoretical goal is to prove that the bootstrap yields a consistent approximation of as becomes large. Toward this goal, we will rely on two simplifications that are customary in analyses of bootstrap and ensemble methods. First, we will exclude the Monte-Carlo error arising from the finite number of bootstrap replicates, as well as the error arising from the estimation of . For this reason, our results do not formally require the training or hold-out points to be i.i.d. copies of the test point — but from a practical standpoint, it is natural to expect that this type of condition should hold in order for Algorithm 2 (or its oob version) to work well.

Second, we will analyze a simplified type of ensemble, which we will refer to as a first-order model. This type of approach has been useful in gaining theoretical insights into the behavior of complex ensemble methods in a variety of previous works Biau, Devroye and Lugosi (2008); Biau (2012); Arlot and Genuer (2014); Lin and Jeon (2006); Genuer (2012); Scornet (2016a, b). In our context, the value of this simplification is that it neatly packages the complexity of the base classifiers, and clarifies the relationship between and quality of the bootstrap approximation. Also, even with such simplifications, the theoretical problem of proving bootstrap consistency still leads to considerable technical challenges. Lastly, it is important to clarify that the first-order model is introduced only for theoretical analysis, and our proposed method does not rely on this model.

3.1 A first-order model for randomized ensembles

Any randomized classifier may be viewed as a stochastic process indexed by . From this viewpoint, we say that another randomized classifier is a first-order model for if it has the same marginal distributions as , conditionally on , which means

(3.1)

Since takes values in the finite set of binary vectors , the condition (3.1) is equivalent to

(3.2)

where the expectation is only over the algorithmic randomness in and . A notable consequence of this matching condition is that the ensembles associated with and have the same error rates on average. Indeed, if we let be the error rate associated with an ensemble of independent copies of , then it turns out that

(3.3)

for all , where is the error rate for , as before. (A short proof is given in Appendix E.) In this sense, a first-order model is a meaningful proxy for with regard to statistical performance — even though the internal mechanisms of may be simpler.

3.1.1 Constructing a first-order model

Having stated some basic properties that are satisfied by any first-order model, we now construct a particular version that is amenable to analysis. Interestingly, it is possible to start with an arbitrary random classifier , and construct an associated in a relatively explicit way.

To do this, let be fixed, and consider the function

(3.4)

which takes values in the “full-dimensional” simplex , defined by

For any fixed , there is an associated partition of the unit interval into sub-intervals

such that the width of interval is equal to for . Namely, we put , and for ,

Lastly, for , we put

Now, if we let be fixed, and let Uniform, then we define to have its th coordinate equal to the following indicator variable

where . It is simple to check that the first-order matching condition (3.2) holds, and so is indeed a first-order model of . Furthermore, given that is defined in terms of a single random variable Uniform, we obtain a corresponding “first-order ensemble” via an i.i.d. sample of uniform variables , which are independent of . (The th coordinate of the th classifier is given by .) Hence, with regard to the representation in equation (1.2), we may make the identification

when the first-order model holds with .

Remark

To mention a couple of clarifications, the variables are only used for the construction of a first-order model, and they play no role in the proposed method. Also, even though the “randomizing parameters” are independent of , the classifiers still depend on through the function .

Interpretation of first-order model

To understand the statistical meaning of the first-order model, it is instructive to consider the simplest case of binary classification, . In this case, is a Bernoulli random variable, where . Since almost surely as (conditionally on ), the majority vote of an infinite ensemble has a similar form, i.e. . Hence, the classifiers can be viewed as “random perturbations” of the asymptotic majority vote arising from . Furthermore, if we view the number as score to be compared with a threshold, then the variable plays the role of a random threshold whose expected value is . Lastly, even though the formula might seem to yield a simplistic classifier, the complexity of is actually wrapped up in the function . Indeed, the matching condition (3.2) allows for the function to be arbitrary.

3.2 Bootstrap consistency

We now state our main result, which asserts that the bootstrap “works” under the first-order model. To give meaning to bootstrap consistency, we first review the notion of conditional weak convergence.

Conditional weak convergence

Let

be a probability distribution on

, and let be a sequence of probability distributions on that depend on the randomizing parameters . Also, let be the bounded Lipschitz metric for distributions on  (van der Vaart and Wellner, 1996, Sec 1.12), and let

be the joint distribution of

. Then, as , we say that in -probability if the sequence converges to 0 in -probability.

Remark

If a test point is drawn from class , then we denote the distribution of the random vector , conditionally on , as

which is a distribution on the simplex . Since this distribution plays an important role in our analysis, it is worth noting that the properties of are not affected by the assumption of a first-order model, since for all . We will also assume that the measures satisfy the following extra regularity condition.

Assumption 1.

For the given set , and each , the distribution has a density with respect to Lebesgue measure on , and is continuous on . Also, if denotes the interior of , then for each , the density is on , and is bounded on .

To interpret this assumption, consider a situation where the class-wise test point distributions have smooth densities on with respect to Lebesgue measure. In this case, the density will exist as long as is sufficiently smooth (cf. Appendix F, Proposition F.1). Still, Assumption 1 might seem unrealistic in the context of random forests, because is obtained by averaging over all decision trees that can be generated from , and strictly speaking, this is a finite average of non-smooth functions. However, due to the bagging mechanism in random forests, the space of trees that can be generated from is very large, and consequently, the function represents a very fine-grained average. Indeed, the idea that bagging is actually a “smoothing operation” on non-smooth functions has received growing attention over the years (Buja and Stuetzle, 2000; Bühlmann and Yu, 2002; Buja and Stuetzle, 2006; Efron, 2014), and the recent paper Efron (2014) states that bagging is “also known as bootstrap smoothing”. In Appendix F, we provide additional assessment of Assumption 1, in terms of both theoretical and empirical examples.

Theorem 3.1 (Bootstrap consistency).

Suppose that the first-order model holds for all , and that Assumption 1 holds. Then, for the given set , there are numbers and such that as ,

(3.5)

and furthermore,

Remarks

In a nutshell, the proof of Theorem 3.1 is composed of three pieces: showing that can be represented as a functional of an empirical process (Appendix A.1), establishing the smoothness of this functional (Appendix A.2), and employing the functional delta method (Appendix A.3). With regard to theoretical techniques, there are two novel aspects of the proof. The problem of deriving this functional is solved by introducing a certain lifting operator, while the problem of showing smoothness is based on a non-smooth instance of the first-variation formula, as well as some special properties of Bernstein polynomials. Lastly, it is worth mentioning that the core technical result of the paper is Theorem A.1.

To mention some consequences of Theorem 3.1, the fact that the limiting distribution of has mean 0 shows that the fluctuations of have more influence on algorithmic convergence than the bias (as illustrated in Figure 1). Second, the limiting distribution motivates a convergence criterion of the form , since asymptotic normality it indicates that the event should occur with high probability when is large, and again, this is apparent in Figure 1. Lastly, the theorem implies that the quantiles of agree asymptotically with their bootstrap counterparts. This is of interest, because quantiles allow the user to specify a bound on that holds with a tunable probability. Quantiles also provide an alternative route to estimating algorithmic variance, because if denotes the interquartile range of , then the theorem implies where .

4 Practical considerations

In this section, we discuss some considerations that arise when the proposed method is used in practice, such as the choice of error rate, the computational cost, and the choice of a stopping criterion for algorithmic convergence.

4.1 Extension to class-wise error rates

In some applications, class-wise error rates may be of greater interest than the total error rate . For any , let denote the distribution of the test point given that it is drawn from class . Then, the error rate on class is defined as

(4.1)

and the corresponding algorithmic variance is

In order to estimate , Algorithm 2 can be easily adapted using either hold-out or oob points from a particular class. Our theoretical analysis also extends immediately to the estimation of (cf. Section A.1).

4.2 Computational cost and extrapolation

A basic observation about Algorithm 2 is that it only relies on the array of predicted labels (or alternatively ). Consequently, the algorithm does not require any re-training of the classifiers. Also, with regard to computing the arrays or , at least one of these is typically computed anyway when evaluating an ensemble’s performance with hold-out or oob points — and so the cost of obtaining or will typically not be an added expense of Algorithm 2. Thirdly, the algorithm can be parallelized, since the bootstrap replicates can be computed independently. Lastly, the cost of the algorithm is dimension-free with respect to the feature space , since all operations are on the arrays or , whose sizes do not depend on the number of features.

To measure the cost of Algorithm 2 in terms of floating point operations, it is simple to check that at each iteration , the cost of evaluating is of order , since has columns, and each evaluation of the plurality voting function has cost . Hence, if the arrays or are viewed as given, and if , then the cost of Algorithm 2 is , for either the hold-out or oob versions. Below, we describe how this cost can be reduced using a basic form of extrapolation (Bickel and Yahav, 1988; Sidi, 2003; Brezinski and Zaglia, 2013).

Saving on computation with extrapolation

To explain the technique of extrapolation, the first step produces an inexpensive estimate by applying Algorithm 2 to a small initial ensemble of size . The second step then rescales so that it approximates for . This rescaling relies on Theorem 3.1, which leads to the approximation, . Consequently, we define the extrapolated estimate of as

(4.2)

In turn, if the user desires for some , then should be chosen so that

(4.3)

which is equivalent to .

In addition to applying Algorithm 2 to a smaller ensemble, a second computational benefit is that extrapolation allows the user to “look ahead” and dynamically determine how much extra computation is needed so that is within a desired range. In Section 5, some examples are given showing that can be estimated well via extrapolation when .

Comparison with the cost of training a random forest

Given that one of the main uses of Algorithm 2 is to control the size of a random forest, one would hope that the cost of Algorithm 2 is less than or similar to the cost of training a single ensemble. In order to simplify this comparison, suppose that each tree in the ensemble is grown so that all nodes are split into exactly 2 child nodes (except for the leaves), and that all trees are grown to a common depth . Furthermore, suppose that , and that random forests uses the default rule of randomly selecting from features during each node split. Under these conditions, it is known that the cost of training a random forest with trees via CART is at least of order  (Breiman et al., 1984, p.166). Additional background on the details of random forests and decision trees may be found in the book Friedman, Hastie and Tibshirani (2001).

Based on the reasoning just given, the cost of running Algorithm 2 does not exceed the cost of training trees, provided that

where the factor arises from the extrapolation speedup described earlier. Moreover, with regard to the selection of , our numerical examples in Section 5 show that the modest choice allows Algorithm 2 to perform well on a variety of datasets.

The choice of the threshold

When using a criterion such as (4.3) in practice, the choice of the threshold will typically be unique to the user’s goals. For instance, if the user desires that is within 0.5% of , then the choice would be appropriate. Another option is to choose from a relative standpoint, depending on the scale of the error. If the error is high (say is 40%), then it may not be worth paying a large computational price to ensure that is less than 0.5%. Conversely, if is 2%, then it may be worthwhile to train a very large ensemble so that is a fraction of 2%. In either of these cases, the size of could be controlled in a relative sense by selecting when , where is an estimate of obtained from a hold-out set, and is a user-specified constant that measures the balance between computational cost and accuracy. But regardless of the user’s preference for , the more basic point to keep in mind is that the proposed method makes it possible for the user to have direct control over the relationship between and , and this type of control has not previously been available.

5 Numerical Experiments

To illustrate our proposed method, we describe experiments in which the random forests method is applied to natural and synthetic datasets (6 in total). More specifically, we consider the task of estimating the parameter 3, as well as . Overall, the main purpose of the experiments is to show that the bootstrap can indeed produce accurate estimates of these parameters. A second purpose is to demonstrate the value of the extrapolation technique from Section 4.2.

5.1 Design of experiments

Each of the 6 datasets were partitioned in the following way. First, each dataset was evenly split into a training set and a “ground truth” set , with nearly matching class proportions in and . (The reason that a substantial portion of data was set aside for was to ensure that ground truth values of and could be approximated using this set.) Next, a smaller set with cardinality satisfying was used as the hold-out set for implementing Algorithm 2. As before, the class proportions in and were nearly matching. The smaller size of was chosen to illustrate the performance of the method when hold-out points are limited.

Ground truth values

After preparing , , and , a collection of 1,000 ensembles was trained on by repeatedly running the random forests method. Each ensemble contained a total of 1,000 trees, trained under default settings from the package randomForest (Liaw and Wiener, 2002). Also, we tested each ensemble on to approximate a corresponding sample path of (like the ones shown in Figure 1). Next, in order to obtain “ground truth” values for with , we used the sample standard deviation of the 1,000 sample paths at each . (Ground truth values for each were obtained analogously.)

Extrapolated estimates

With regard to our methodology, we applied the hold-out and oob versions of Algorithm 2 to each of the ensembles — yielding 1,000 realizations of each type of estimate of . In each case, the number of bootstrap replicates was set to , and we applied the extrapolation rule, starting from . If we let and denote the initial hold-out and oob estimators, then the corresponding extrapolated estimators for are given by

(5.1)

Next, as a benchmark, we considered an enhanced version of the hold-out estimator, for which the entire ground truth set was used in place of . In other words, this benchmark reflects a situation where a much larger hold-out set is available, and it is referred to as the “ground estimate” in the plots. Its value based on is denoted , and for , we use

(5.2)

to refer to its extrapolated version. Lastly, class-wise versions of all extrapolated estimators were computed in an analogous way.

5.2 Description of datasets

The following datasets were each partitioned into , and , as described above.

Census income data

A set of census records for 48,842 people were collected with 14 socioeconomic features (continuous and discrete) (Lichman, 2013). Each record was labeled as 0 or 1, corresponding to low or high income. The proportions of the classes are approximately (.76,.24). As a pre-processing step, we excluded three features corresponding to work-class, occupation, and native country, due to a high proportion of missing values.

Connect-4 data

The observations represent 67,557 board positions in the two-person game “connect-4” (Lichman, 2013). For each position, a list of 42 categorical features are available, and each position is labeled as a draw , loss , or win for the first player, with the class proportions being approximately .

Nursery data

This dataset was prepared from a set of 12,960 applications for admission to a nursery school (Lichman, 2013). Each application was associated with a list of 8 (categorical) socioeconomic features. Originally, each application was labeled as one of five classes, but in order to achieve reasonable label balance, the last three categories were combined. This led to approximate class proportions .

Online news data

A collection of 39,797 news articles from the website mashable.com were associated with 60 features (continuous and discrete). Each article was labeled based on the number of times it was shared: fewer than 1000 shares , between 1,000 and 5,000 shares (), and greater than 5,000 shares (), with approximate class proportions .

Synthetic continuous data

Two classes of data points in

, each of size 10,000, were obtained by drawing samples from the multivariate normal distributions

and with a common covariance matrix . The first mean vector was chosen to be , and the second mean vector was constructed to be a sparse vector in the following way. Specifically, we sampled 10 numbers without replacement from , and the coordinates of indexed by were set to the value (with all other coordinates were set to 0). Letting denote the spectral decomposition of

, we selected the matrix of eigenvectors

by sampling from the uniform (Haar) distribution on

orthogonal matrices. The eigenvalues were chosen as

.

Synthetic discrete data

Two classes of data points in , each of size 10,000, were obtained by drawing samples from the discrete distributions Multinomial and Multinomial, where refers to the number of balls in 100 cells, and the cell probabilities are specified by . Specifically, we set , and . The vector was obtained by perturbing and then normalizing it. Namely, letting be a vector of i.i.d. variables, we defined the vector , where refers to coordinate-wise absolute value.

5.3 Numerical results

Interpreting the plots

For each dataset, we plot the ground truth value