MP-Boost: Minipatch Boosting via Adaptive Feature and Observation Sampling

11/14/2020 ∙ by Mohammad Taha Toghani, et al. ∙ Rice University 9

Boosting methods are among the best general-purpose and off-the-shelf machine learning approaches, gaining widespread popularity. In this paper, we seek to develop a boosting method that yields comparable accuracy to popular AdaBoost and gradient boosting methods, yet is faster computationally and whose solution is more interpretable. We achieve this by developing MP-Boost, an algorithm loosely based on AdaBoost that learns by adaptively selecting small subsets of instances and features, or what we term minipatches (MP), at each iteration. By sequentially learning on tiny subsets of the data, our approach is computationally faster than other classic boosting algorithms. Also as it progresses, MP-Boost adaptively learns a probability distribution on the features and instances that upweight the most important features and challenging instances, hence adaptively selecting the most relevant minipatches for learning. These learned probability distributions also aid in interpretation of our method. We empirically demonstrate the interpretability, comparative accuracy, and computational time of our approach on a variety of binary classification tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Boosting algorithms adaptively learn a series of weak learners that overall yield often state-of-the-art predictive accuracy, but are computationally slow. Huge datasets with many instances and features (e.g., data recorded from sensors, texts, and images) highlight this ”slow learning” behavior. For example, AdaBoost [freund1997decision] and gradient boosting  [friedman2001greedy] suffer from a slow training speed since they try to achieve a proper performance neglecting the size of the data. Further, most boosting methods are ”black-box”, meaning that their interpretation[lipton2018mythos] lacks transparency and simplicity. Identification of features and examples with the most impact helps to obtain a more interpretable model [biau2016random, benard2020interpretable]. In this paper, our goal is to develop an AdaBoost-based algorithm that learns faster computationally and also yields interpretable solutions.

We propose to achieve this by adaptively sampling tiny subsets of both instances and features simultaneously, something we refer to as a minipatch learning (Fig. 1). Subsampling is widely used in machine learning and has been shown to have both computational and predictive advantages. For instance, bagging [breiman1996bagging] uses the bootstrap technique [Efron1993AnIT]

to reduce the dependency of weak learners to the training data. Random forest 

[breiman2001random], as a specific type of bagging, reduces the number of efficient features in each weak learner as well. The computational advantages of the ensemble of uniform random minipatches have been investigated in [louppe2012ensembles].

There is preliminary evidence that uniform minipatch selection yields implicit regularization [srivastava2014dropout, lejeune2019implicit]; however, dropout is not precisely what we refer to as minipatch learning. In fact, the combination of minibatch selection (i.e., stochastic optimization) and dropout in the first layer lies in the category of minipatch learning algorithms. Minipatch sampling not only can accelerate ensemble algorithms, but it can also implicitly regularize them, i.e., reduce the prediction’s sensitivity to individual instances or features.

Subsampling can speed up iterative algorithms; however, uniform sampling can appropriately be modified with an adaptive procedure as the importance and difficulty of the observations and features vary. To give an example, in classification problems, observations closer to the decision boundary play the critical role in the ultimate model [cortes1995support]. Moreover, we can use adaptive sampling to interpret the final results. For instance, SIRUS algorithm [benard2019sirus, benard2020interpretable] suggests how to leverage the frequency of splits in random forest trees to generate an interpretable model with fewer splits. Our MP-Boost algorithm will incorporate the advantages of adaptivity in order to learn distributions on the observations and features.

More closely related to our work, several have proposed to employ sampling schemes on the top of AdaBoost. Works along this line usually target either subsampling the features or observations adaptively. Exploiting bandit algorithms, adaptive feature selection methods [busa2009bandit, busa2010fast] have been suggested to reduce the number of effective features accessed by each weak learner. Similarly, Tasting [dubout2011tasting] and Laminating [dubout2011boosting, dubout2014adaptive] propose score-based feature selection algorithms in each iteration of AdaBoost. They subsample features in each iteration adaptively but either utilize all instances to train weak learners or subsample them uniformly.

On the other hand, algorithms like MadaBoost [domingo2000madaboost] and FilterBoost [bradley2008filterboost] suggest adaptive oracle subsampling of the observations. However, they do not subsample features and the size of selected samples increases per iteration, thus get slower during their progress. In contrast, the algorithm that we will propose exploits the effect of minipatch selection; in this way, it will take the importance of both the observations and features into account.

A series of algorithms have been proposed to reduce the computational complexity of gradient boosting. For example, stochastic gradient boosting (SGB) [friedman2002stochastic]

suggests subsampling the observations randomly. XGBoost 

[chen2016xgboost]

by introducing a novel structure for decision trees, a fast training algorithm, as well as several other modifications such as features subsampling techniques through a pre-sorting and histogram-based algorithm, extensively optimizes the computational complexity of gradient boosting.

Fig. 1: Minipatch Learning- Selecting random minipatches of the dataset during the training of an ensemble algorithm.

In this regard, LightGBM [ke2017lightgbm] and CatBoost [dorogush2018catboost]

subsample observations adaptively proportionate to their gradient values and use an algorithm called exclusive feature bundling (EFB) to reduce the number of effective features by categorizing them. In contrast, MP-Boost will be designed to learn a probability distribution on features gradually during its progress instead of categorizing them initially. The minimal variance sampling (MVS) 

[ibragimov2019minimal] algorithm is proposed to select the observations according to their gradient values provably with a minimal variance; however, it lacks subsampling over the features. Note that MP-Boost will be designed based on AdaBoost; thus, its adaptive observation selection will entirely be different from algorithms that are designed based on gradient boosting.

The remainder of the paper is organized as follows. In Section 2, stating the problem formulation, we present MP-Boost. In Section 3, we investigate the efficacy of the adaptive subsampling, interpretability of the proposed algorithm, and compare the generalization accuracy of MP-Boost with that of AdaBoost, gradient boosting, and random forest. We end with discussing and concluding remarks in Section 4.

2 MP-Boost

Our goal is to develop a boosting algorithm utilizing adaptive sampling of features and observations that enhances both scalability and interpretability.

2.1 MP-Boost Algorithm

We begin by focusing on binary classification tasks. Let our data be for observations or instances and features; for each instance, we observe a label, with

. We seek to learn a classifier

.

To achieve this, we propose an adaptive sampling based version of boosting inspired by AdaBoost. Our method relies on learning a weak learner from a tiny subset of observations and features at each iteration. We call this tiny subset a minipatch, termed based on the use of ”patches” in image processing, and minibatches as small subsamples of observations commonly used in machine learning. Our approach is to take an ensemble of minipatches, or minipatch learning as shown in Figure 1, where each minipatch is sampled adaptively. Formally, we define a minipatch by , where is a subset of observations with size , and is a subset of features with size . By learning weak learners on tiny subsets or minipatches, our boosting algorithm will have major computational advantages for large and/or large datasets.

We define to be the class of weak learners [mohri2018foundations], where each is a function . Our algorithm is generic to the type of weak learners, which can be either simple or expressive. However, we select decision trees [breiman1984classification] as the default weak learner in MP-Boost. We consider both depth- trees as well as saturated trees that are split until each terminal leaf consists of samples from the same class.

The core of our algorithm uses adaptive sampling of observations to achieve the adaptive slow learning properties [wyner2017explaining] of the AdaBoost algorithm. Similar to [ibragimov2019minimal] for gradient boosting, MP-Boost subsamples observations according to an adaptive probability distribution. Let be the probability distribution on observations (i.e., ) and initially set to be uniform (). We define as sampling a subset of of size according to the probability distribution without replacement.

Let be the ensemble function. Our algorithm selects a minipatch, trains a proper weak learner on it, and computes the summation of weak learners, . Misclassified samples are more difficult to be learned, so we need to increase their probabilities to be sampled more frequently. Let be a function that measures the similarity between the ensemble outputs and labels, i.e., positive yields smaller and vice versa. MP-Boost assigns a probability proportional to to the observation. Table I shows the choices for function inspired by the weighting function in AdaBoost and LogitBoost [friedman2000additive]. Note that ”Soft functions” are sensitive to the ensemble output value, while ”Hard functions” merely care about its sign.

Full-batch boosting algorithms reweight all of the observations and train a new weak learner on their weighted average in each iteration [freund1997decision, friedman2000additive, friedman2001greedy]. In contrast, stochastic algorithms use each sample’s frequency to take the effect of its weight into account[domingo2000madaboost, bradley2008filterboost, ibragimov2019minimal]. We update the probability of the observations similar to FilterBoost [bradley2008filterboost]. However, FilterBoost increases during its progress and uses negative sampling, and is thus slower than ours.

Exponential Logistic
Soft
Hard
TABLE I: choices for weighting samples- is a decreasing function with respect to . Hard functions depend on the sign of

that makes MP-Boost less sensitive to outlier samples.

Another major goal is to increase the interpretability of boosting approaches. To accomplish this, we also propose to adaptively select features that are effective for learning. Similar to , let be the probability distribution on features. Our algorithm requires a criterion to compute the importance of the selected features based on the structure of . There exist several choices for computing features importance based on . Some of these inspection techniques are model agnostic, hence proper for MP-Boost to incorporate weak learners from different classes. For example, the permutation importance method [breiman2001random, altmann2010permutation] shuffling each feature infers its importance according to the difference in the prediction score.

Nevertheless, specific metrics like impurity reduction score [nembrini2018revival] are defined for decision trees. We utilize this quantity to define our probability distribution over the features. Let

denote the normalized feature importance vector for a weak learner

, wherein each entry determines the relative importance of the corresponding feature compared to other features in the minipatch. In each iteration, is updated through computing the weighted average of and according to a momentum

. The hyperparameter

determines the ratio of exploration vs. exploitation. MP-Boost only modifies the probability of features inside the minipatch, in each iteration.

MP-Boost
Initialization ():
// observation probabilities
// feature probabilities
// ensemble output
// out-of-patch output
while not met do

  1. Sample a minipatch:

    1. // select instances

    2. // select features

    3. // minipatch

  2. Train a weak learner on the minipatch:

    1. : weak learner trained on

  3. Update outputs:

  4. Update probability distributions:


    1. where,

  5. Out-of-Patch Accuracy:

end while
Return

Algorithm 1 MP-Boost

Finally, many boosting algorithms are designed to run for a fixed number of iterations[friedman2001greedy] or use a validation criterion [domingo2000madaboost, bradley2008filterboost, Meijer2016RegularizingAW] in order to determine when to stop. Internal validation approaches often have better performance and are computationally much faster. For instance, consider the out-of-bag criterion [breiman1996out] in bagging and random forest that uses internal validation properties without incurring any additional computational cost. Similarly to bagging, our MP-Boost has access to out-of-patch instances, which we can use for internal validation. Therefore, for each sample , we accumulate the output of weak learners that do not have it in their minipatch. Thus, we define out-of-patch output to be a function as follows:

(1)

for an arbitrary . Accordingly, the out-of-patch accuracy, , can easily be quantified.

Out-of-patch accuracy is a conservative estimate of the test accuracy. Hence it can assist MP-Boost to track the progress of the generalization (test) performance internally and decide at which iteration to stop. In a nutshell, observing the

value, the algorithm finds where it is saturated. Algorithm 2 (in Appendix B

) is a heuristic algorithm that takes

, compares it with its previous values, and finally decides when it becomes saturated. In fact, the stopping algorithm follows a general rule; if the current value of increases with some margin, then the algorithm needs more time to improve; otherwise, the generalization performance is saturated.

We put all of this together in a summary of our MP-Boost algorithm in Algorithm 1. Notice here that selecting minipatches (step 1) reduces the computational complexity imposed per iteration, thus improves the scalability. Other variations of AdaBoost usually subsample either features or observations while ours exploits both. Therefore, in addition to the predictive model , our algorithm learns probability distributions and that express the importance of observations and features, respectively. Since learning is a part of the iterative procedure, it does not incur an extra computational cost. In addition, MP-Boost exploits an internal validation that yields an automatic stopping criterion when the algorithm ceases to learn. Hence, steps (4) and (5) of Algorithm 1 highlight the main differences of MP-Boost with other sampling-based boosting algorithms.

2.2 Hyperparameter Tuning

The minipatch size is a crucial hyperparameter of our algorithm. Large or slows down weak learners’ training but is more likely to yield better performance. Note that must be large enough to provide a wide range of features for comparison and update features probability properly. Similarly, must be large enough such that each minipatch represents a meaningful subset of the observations. On the other hand, small results in a value akin to the generalization accuracy. Selecting around ten percent of the observations and features seems to be a proper choice for our algorithm, as evidenced in our empirical studies. Additionally, our studies reveal that the results are fairly robust to small changes in and . is the other important hyperparameter in our algorithm where a moderate value for it (e.g., ) strikes a balance between exploration and exploitation. While our default hyperparameter settings seem to perform well and are robust in a variety of settings (see Section 3), one could always select these in a data-driven manner using our criterion as well.

2.3 Extensions

Our proposed algorithm is initially developed for the binary classification problem. Here, we discuss its extension to regression and multiclass classification problems. First, for multiclass classification, algorithms like AdaBoost.M2, AdaBoost.MH, and AdaBoost.OC [schapire1997using, chengsheng2017adaboost] are proposed as multiclass extensions of AdaBoost. We can incorporate similar modifications to extend MP-Boost as well. Further, [drucker1997improving, solomatine2004adaboost] have suggested regression extensions like AdaBoost.R2 and AdaBoost.RT to the vanilla AdaBoost. Obviously, for regression, we will have to change our observation probability function and the out-of-patch accuracy, thus we can employ techniques in [breiman1996out] to address the regression problem. All these can be further extensions to our approach.

3 Experiments

We begin by using an illustrative case study to show how our method works and how it aids interpretability. Next, we compare our algorithm to other popular boosting and tree-based methods, focusing on accuracy and scalability.

   

(a)

(b)
Fig. 2:

(a) Effect of MP-Boost adaptive sampling on MNIST (

vs. ) test data. (b) Train, test, and out-of-patch accuracy of MNIST ( vs. ) using MP-Boost. The dashed line shows the stopping time based on Algorithm2.

3.1 Illustrative Case Study

We use a series of experiments to demonstrate how our algorithm works and show how to interpret the results. Specifically, we ask: how does adaptive sampling boost the performance; how does the out-of-patch accuracy relate to test accuracy and yield a data-driven stopping criterion; how do we interpret the results via the final value of and ?

To answer these questions, we focus our investigations on an explicable binary classification task: detecting digit versus in MNIST [lecun1998mnist]. This dataset includes handwritten digits as images of size . The training data is huge () and high-dimensional (). We use cross-validation to tune all hyperparameters, yielding , , and as well as the Soft-Logistic function I as .

(a) Before training
(b) After training
Fig. 3: Probability distribution of MNIST (digit vs. ) samples before and after training MP-Boost- Samples are Projected onto a -dimensional space using PCA. The size of each point (sample) indicates its relative probability. After training, samples close to the boundary have higher probabilities. The displayed samples are visually hard to be learned and have high probabilities.

To measure the effect of adaptive observation and/or feature selection, we train MP-Boost on MNIST(). Then, we turn off the adaptive updates for

and replace it with the uniform distribution and resulting random observation sampling. We do the same for

analogously for the features. Finally, we turn off adaptive sampling for both features and observations, and repeat each experiment times. Figure 1(a) shows the superiority of joint adaptive sampling of both observations and features in terms of performance on the test data. Further, we compare the training, out-of-patch, and test accuracy curves for MP-Boost in Figure 1(b). The dashed line indicates the stopping time of our algorithm based on the curve and our stopping heuristic Algorithm 2. To observe the behavior of the three curves, we let MP-Boost continue after the stopping criterion is satisfied. As shown in Figure 1(b), and unlike the training curve, the trend in the out-of-patch curve is similar to that of the test curve. These results demonstrate the power that adaptive sampling of both observations and features brings to boost performance as well as the ability to use the curve to assess algorithm progress.

Next, we illustrate how to use and to interpret the observations and features. First, observations that are difficult to classify are upweighted in . Hence, we can use

to identify the most challenging samples, yielding a similar type of interpretation commonly employed in support vector machines

[cortes1995support]. To visualize the final value of , we project observations on a two-dimensional space using PCA with sizes of each observation proportional to . We show this visualization before and after the training in Figure 3. Further, we highlight a few of the observations with large values of , illustrating that these samples are indeed difficult to distinguish between the two classes.

(a)
(b)
(c)
(d)
(e)
(g) final
(f)
(f)
Fig. 4: Probability distribution of MNIST (digit vs. ) features during the training of MP-Boost- The color of each pixel (feature) indicates the relative probability with respect to which the feature is selected. As MP-Boost progresses, the probability of ineffective features (pixels in the background) decreases while that of efficient features (pixels shaping the digits) increases. The final distribution resembles a blurred version of digits and .

Finally, we show how to use to find the most important features, and also illustrate how MP-Boost learns these features probability in Figure 4. Here, the color of each pixel (feature) is proportional to its probability, with darker pixels indicating the feature has been upweighted. We expect a sparse representation for which matches with our result. Moreover, this example clearly shows how to interpret the most relevant features; in Figure 3(f), two regions are darker compared to other pixels corresponding with the complementary area for digit versus .

3.2 Comparative Empirical Results

Here, we compare the speed and performance of our algorithm with AdaBoost, gradient boosting, and random forest on multiple binary classification tasks. To this end, we select large real datasets (usually and ) from UCI machine learning repository [Dua:2019], MNIST [lecun1998mnist]

, and CIFAR-10

[krizhevsky2009learning]. Moreover, we use a sparse synthetic dataset of two high-dimensional cones. The size of datasets is provided in Appendix A.

To be fair, we choose the oracle hyperparameters for every method. To this end, we pick decision trees with different maximum depths () or depth-saturated trees as weak learners. Note that we use scikit-learn [scikit-learn] modules to implement our algorithm and all competitors so that all time-based comparisons are fair. For our method, we select from the following choices of hyperparameters: , , and .

For each dataset, we select the best performance of MP-Boost, versus that of AdaBoost, gradient boosting, and random forest constrained to the runtime of MP-Boost. Table II shows the best performance of each algorithm within the fixed runtime (MP-Boost training time). Results indicate that MP-Boost achieves a better performance faster than the other three algorithms. We also provide a more comprehensive comparison in Table III, where we can see that without any runtime constraint, MP-Boost is much faster with a comparable accuracy across a wide variety of datasets.

Dataset MP-Boost AdaBoost Random Forest Gradient Boosting
Cones
Hill-Valley
Christine
Jasmine
Philippine
SensIT Vehicle
Higgs Boson
MNIST (3,8)
MNIST (O,E)
CIFAR-10 (T,C)
GAS Drift
DNA
Volkert
TABLE II: Best test accuracy (%) of MP-Boost, AdaBoost, Random Forest, and Gradient Boosting on binary classification tasks during the runtime of MP-Boost.

4 Discussion

In this work, we proposed a new boosting approach using adaptive minipatch learning. We showed that our approach is computationally faster, as well as more interpretable, compared to standard boosting algorithms. Moreover, we showed that our algorithm outperforms AdaBoost and gradient boosting in a fixed runtime and has a comparable performance without any time constraint.

Our approach would be particularly useful for large data settings where both observations and features are large. Further, our MP-Boost algorithm can be particularly fitting to datasets with sparse features or noisy features because we learn the important features. Further, our adaptive observation selection can be used in active learning problems.

We suggested extensions of MP-Boost for multiclass classification and regression problems in this paper. In future work, we plan on comparing them empirically to existing algorithms. Other aspects of future work would be to use this minipatch learning scheme in not only AdaBoost-like methods but also in gradient boosting-based methods. Additionally, one could explore other updating schemes for both the observation and feature probabilities. Theoretical analysis of MP-Boost would help to interpret the performance of our algorithm and its properties. Efficient implementations can improve the speed of current MP-Boost for different schemes. Likewise, efficient memory allocation is another essential work ahead of this project, where reducing the required hardware enables industry-scale problems to exploit properties in MP-Boost.

Dataset Criteria MP-Boost AdaBoost Random Forest Gradient Boosting
Cones Test
Time
Hill-Valley Test
Time
Christine Test
Time
Jasmine Test
Time
Philippine Test
Time
SensIT Vehicle Test
Time
Higgs Boson Test
Time
MNIST (3,8) Test
Time
MNIST (O,E) Test
Time
CIFAR-10 (T,C) Test
Time
GAS Drift Test
Time
DNA Test
Time
Volkert Test
Time
TABLE III: Best test accuracy (%) and its corresponding training time (sec) of MP-Boost, AdaBoost, Random Forest, and Gradient Boosting on binary classification tasks.

References

Appendix A Dataset

According to our specific problem, we select datasets listed in Table IV from UCI machine learning repository [Dua:2019], MNIST [lecun1998mnist], and CIFAR [krizhevsky2009learning]. Note that we consider multiclass datasets like MNIST and CIFAR-10 and select two classes that are harder to be distinguished like digits

from MNIST or ”Truck & Car” classes from CIFAR-10. Additionally, we partition all classes into two for different multiclass datasets, like ”Odd & Even” digits in MNIST. About 20% of each dataset is selected as the test data. ”Cones” is a sparse synthetic dataset of two originally

-dimensional cones. Adding noise features to them, we transform data points to .

Dataset
Cones 20000 500 5000
Hill-Valley 1000 100 200
Christine 4300 1636 1100
Jasmine 2400 144 600
Philippine 4700 308 1100
SensIT Vehicle 78800 100 19700
Higgs Boson 200000 30 50000
MNIST (3 vs. 8) 12000 784 2000
MNIST (Odd vs. Even) 60000 784 10000
CIFAR-10 (Truck vs. Car) 10000 3072 2000
Gas Drift 11100 128 2800
DNA 2600 180 6000
Volkert 46600 180 11700
Fabert 6600 800 1600
TABLE IV: Size of the datasets in Tables II and III

Appendix B Stopping Criterion Algorithm

The trend in is an approximation of the generalization accuracy; however, there are numerical oscillations in it. To make the stopping algorithm (Algorithm 2) robust to such numerical vibrations, instead of the largest previous value of , it keeps the highest previous values of in a list, , and compares the with their minimum. If the value does not improve after sequential iterations, then MP-Boost halts.

   // ratio of the growth must be
      // number of top
   // number of successive iterations without progress
a list containing zeros    // keeping the top largest values of up to each iteration
Stopping-Criterion

  1. Detect the best iteration:

    1. if , set

  2. Check the stopping criteria:

    1. if or , halt Algorithm 1
      0         // maximum number of iterations

  3. Check the progress of oop:

  4. Update :

    1. if , then replace the minimum entry with

Algorithm 2 Stopping-Criterion