Efficient Identification of Approximate Best Configuration of Training in Large Datasets

11/08/2018 ∙ by Silu Huang, et al. ∙ Microsoft University of Illinois at Urbana-Champaign 0

A configuration of training refers to the combinations of feature engineering, learner, and its associated hyperparameters. Given a set of configurations and a large dataset randomly split into training and testing set, we study how to efficiently identify the best configuration with approximately the highest testing accuracy when trained from the training set. To guarantee small accuracy loss, we develop a solution using confidence interval (CI)-based progressive sampling and pruning strategy. Compared to using full data to find the exact best configuration, our solution achieves more than two orders of magnitude speedup, while the returned top configuration has identical or close test accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Increasing the productivity of data scientists has been a target for many machine learning service providers, such as Azure ML, DataRobot, Google Cloud ML, and AWS ML. For a new predictive task, a data scientist usually spends a vast amount of time to train a good ML solution. A proper

configuration, i.e., the combination of preprocessing, feature engineering, learner (i.e., training algorithm) and the associated hyperparameters, is critical to achieving good performance. It usually takes tens or hundreds of trials to select a suitable configuration.

There are AutoML tools like auto-sklearn [Feurer et al.2015] to automate these trials, and output a configuration with highest evaluated performance. However, both the manual and AutoML approaches have become increasingly inefficient as the available ML data volume grows to millions or more. Even the trial for a single configuration can take hours or days for such large-scale datasets. Motivated by this efficiency issue, we propose a module called approximate best configuration (Abc). Given a set of configurations, it outputs the approximate best configuration, such that the accuracy loss to the best configuration is below a threshold. Our goal is to efficiently identify the approximate best configuration.

The intuition behind Abc is that the ML model trained over a sampled dataset can be used to approximate the model trained over the full dataset. However, the optimal sample size to determine the best configuration up to an accuracy loss threshold is unknown. We develop a novel confidence interval (CI)-based progressive sampling and pruning solution, by addressing two questions: (a)

CI estimator:

given a sampled training dataset, how to estimate the confidence interval of a configuration’s real performance with full training data? (b) scheduler:as the optimal sample size is unknown a priori, how to allocate appropriate sample size for each configuration?

Contributions. The contributions of this paper are as follows:

  • We develop an Abc framework using progressive sampling and CI-based pruning strategy. It ensures finding an approximate best configuration while reducing the run time.

  • We present and prove bounds for the real test accuracy when the ML model is trained using full data, based on the model trained with sampled data.

  • Within Abc, we design a scheduling scheme based on the confidence interval, for allocating sample size among different configurations. The schedule is approximately optimal.

  • We conduct extensive experiments with different datasets and varying configuration sets. We demonstrate that our Abc solution is tens to hundreds of times faster, while returning top configurations with no more than 1% accuracy loss.

2 Problem Formulation

Notions and Notations. In this paper, we focus on classification tasks with a large set of labeled data

. In order for reliable evaluation of a trained classifier, data scientists usually split the available data

randomly into training and testing set and . After that, they specify a number of configurations of the ML workflow and try to identify the best configuration. Let be the candidate configuration set and be the configuration in . We further let be the number of configurations, i.e., . Using terminology from learning theory, each configuration defines a hypothesis space , where each hypothesis is a possible classifier trained under this configuration. Given a training dataset , the learner in will output a hypothesis as the trained classifier. The quality of the classifier is measured against the heldout testing data . In this paper, we focus on accuracy as the quality metric. We denote the accuracy of hypothesis on dataset as . In particular, given a configuration , we define its real test accuracy as .

Problem Definition. A standard practice to select the best configuration from a configuration set is to train with each configuration using full training data, and then pick the one with the highest test accuracy, i.e., . Note that an implicit assumption made here is that the returned classifier with full training data has equal or higher test accuracy than the classifier trained with sampled training data. We follow that assumption in this paper. From a user’s perspective, if there are multiple configurations with nearly identical highest real test accuracy, then it would suffice to return any of them as the best configuration. So we introduce a new problem approximate best configuration identification, as formalized in Problem 1.

Problem 1 (Approximate Best Configuration Identification).

Given a configuration candidate set and an accuracy loss tolerance , identify a configuration whose real test accuracy is within away from that of the best configuration , i.e., , and minimize the total run time.

3 CI-based Framework

Before introducing our framework, we first describe some insights based on simple observations. We experiment on the FlightDelay dataset published in Azure Machine Learning gallery [Mund2015] with five learners (as five configurations). Readers can refer to Table 1 for detailed statistics of this dataset, where and is the number of records and features respectively. The learning curve for each configuration is depicted in Figure 1, where x-axis is the training sample size in log-scale and y-axis is the test accuracy on . In general, the test accuracy approaches the real test accuracy with the increase of the training sample size. When the sample size is large enough (> 2M), the configuration with the highest test accuracy is LightGBM – the true best configuration.

Origin
TwitterSentiment 1.4M 9866 Twitter, Stanford
FlightDelay 7.3M 630 U.S. Department of Transportation
NYCTaxi 10M 21 NYC Taxi & Limousine Commission
HEPMASS 10M 28 UCI
HIGGS 10.6M 28 UCI
Table 1: Dataset Description

Furthermore, the optimal sample size to minimize the run time could vary for different configurations. If we magically know that we should use 2M training samples for LightGBM and 16K training samples for all the other configurations, we can save even more time and still identify the correct best configuration. Unfortunately, the optimal sample size for each configuration is unknown. A natural idea is to increase the sample size gradually, until a plateau is reached in the learning curve. However, a naive plateau estimator based on the learning curve is error-prone. As shown from Figure 1, LightGBM’s learning curve is flat from 32K to 128K. If we stop increasing the sample size for it, it will be mis-pruned. Therefore, a more robust strategy is needed.

Figure 1: Learning Curve

High-level Idea. The main idea is to estimate the confidence interval (CI) of each configuration’s real test accuracy with sampled data. In each round, we train the classifier for a selected configuration on some sampled data. We call such training a probe. After a probe, we update the confidence interval for the configuration. As the sample size increases, the confidence interval shrinks, and the badly-performing configurations can be pruned based on the CIs. Such a CI-based framework is more robust than using point estimates like the learning curve.

1 Input: configuration set , accuracy loss threshold ;
2 Output: the approximate best configuration;
initialization: , , ;
  // is the remaining configuration set
3 while  do
       Probe() ;
        // train on sampled data
4       CIEstimator () ;
5       if  then  ;
6       for  do // pruning
7             if  then  ;
8            
9       Scheduler ()
10return ;
Algorithm 1 Abc

Detailed Algorithm. Specifically, Abc proceeds round by round as shown in Algorithm 1, where each configuration is annotated with its current sample size (), lower bound (), and upper bound (). In each round within the while loop (line 4), it first probes the configuration (line 5). Then it calls a CIEstimator subroutine to quickly estimate the confidence interval for (line 6). Next, it prunes badly-performing configurations (line 7-9). Line 7 identifies the configuration with the largest lower bound. Line 8-9 prunes an configuration if its upper bound is within away from the largest lower bound. At last, it calls a Scheduler subroutine to determine which configuration to probe next as well as its sample size (line 10).

We describe CIEstimator and Scheduler in the next two sections. All absent proofs are presented in Appendix.

4 CI Estimator

In this section, we will derive a CIEstimator for each configuration’s real test accuracy, based on the probe over sampled data. For configuration , the confidence interval needs to contain the real test accuracy

with high probability. The computation of

and needs to be efficient, i.e., no slower than the probe. In the following, we assume is fixed and omit it in the notations.

At the first glance, the CI estimation may remind readers of the generalization error bounds (e.g., VC-bound). The generalization error bound is a universal bound of the difference between each hypothesis’s accuracy in training data and its accuracy in infinite data following the same distribution. Nevertheless, the confidence interval we need is the range of the real test accuracy of the hypothesis trained from full training data, while we only have the hypothesis trained from a sample . Therefore, we cannot apply generalization error bound to obtain our confidence interval.

Upper bound. The intuition behind the confidence interval estimation is that we need to relate the two hypotheses and , and use the information we have on to infer the performance of . To upper bound the accuracy of , we leverage a fitness condition: the training process produces a hypothesis that fits the training data. When the configuration is fixed, the accuracy in a dataset of the hypothesis trained on should be no worse than the hypothesis trained on a different dataset . It is the only assumption we need to prove the upper bound, no matter what training algorithm is used. Under this condition, we found an inequality chain to connect the training accuracy to the real test accuracy of .

Theorem 1 (Upper Bound).

Under the fitness condition, with probability at least , , where

Figure 2: Notations Used in CI Estimation and Analysis
Proof.

We use Figure 2 to summarize the notations and their relationships which are important for understanding the theoretical results. , , and correspond to the returned hypothesis after training a fixed configuration with full training dataset , the sampled training dataset , and the full data respectively. For instance, Figure 2(a) shows the overall derivation relationships among , , , , , and . First, the full training data and the full testing data are randomly split from the whole data . Second, the sampled training data and the sampled testing data are randomly drawn from the full training data and full testing data . Last, is trained from the sampled training data . Note that the CI estimator only has access to and . Though and are not accessible, they are useful in our analysis.

Let us first recall the fitness condition. Given a fixed configuration , let and be the hypothesis returned by training on two different sample sets and , respectively. Note that and are both from the same fixed hypothesis space . Our assumption is that has no lower accuracy on than . Similarly, has no lower accuracy on than .

First, let us break down into four clauses, as shown in Equation (1).

(1)

Since and are randomly split from , we have . Let be the hold-out ratio. For any , we have:

(2)

Next, apply Equation (2) to the first clause in Equation (1):

(3)

The inequality is derived from the fitness assumption (recall from Figure 2 that is trained from and is trained from ).

Next, we bound the second clause in Equation (2) with Hoeffding’s inequality: With probability at least ,

(4)

Similarly, with probability at least ,

(5)

Please note that Equation (4) and (5) will not hold if we replace with . This is because for hypothesis , and cannot be regarded as random samples, since is tailored to the sample set . Therefore, introducing is necessary in our analysis.

Last, since is trained from , by the fitness assumption we have:

(6)

By substituting the four clauses in Equation (1) with Equation (3)-(6), we obtain Theorem 1 using union bound. ∎

Our upper confidence bound has an additive form with three components: the training accuracy on the sampled training dataset , a variation term due to training sample size , and a variation term due to full testing data size . Intuitively, increases as the training accuracy increases, because higher training accuracy indicates higher potential of the configuration’s learning ability. But that potential decreases as the training sample size increases, because the more data we have used, the less room for improvement by adding more training data. Finally, since the real test accuracy is measured in the full testing data , the variation due to the random split needs to be added to . The larger is, the smaller the variation is. Both the variation terms are affected by the confidence probability . Higher confidence probability corresponds to wider confidence interval, thus larger . In sum, is positively correlated with the training accuracy and the number of configurations , and negatively correlated with the training sample size and full testing data size.

Note that the computation of is no slower than the probing (i.e., training with sampled data). In fact, the testing is usually much more efficient than training for the same scale of dataset.

Lower Bound. The lower bound is easier due to the presumption discussed in the problem formulation: Full training data produce better hypothesis than sampled training data for a fixed configuration. The real test accuracy of can then be lower bounded by . However, the computation of can be slower than probing, if . To make the CI estimation efficient, we also sample the testing data. We denote the sampled testing data as . We can then lower bound by minus some variation term (using standard concentration bounds).

Theorem 2 (Lower Bound).

With probability at least ,

Proof.

First, in the problem formulation we have assumed

(7)

Next, based on Hoeffding inequality, with probability at least ,

(8)

Combining Equation (7) and (8), we have with probability at least . ∎

Our lower confidence bound is expressed as the accuracy of in the sampled testing dataset , minus a variation term due to testing sample size . As the testing sample size increases, the difference between and becomes smaller, and the lower bound rises. Higher confidence probability corresponds to smaller . In sum, is positively correlated with the testing sample size and the testing accuracy in the sample, and negatively correlated with .

Discussion. There is an implicit assumption in deriving the lower bound: for any . We argue that even though this assumption is not exactly satisfied in practice, it holds closely enough to provide useful results. That is, in most cases this assumption holds, and even when this assumption is violated, we can adjust the algorithm a bit to provide meaningful result. Specifically, assume there exists some configuration and with , let us consider the following two scenarios: (a) is not selected as the best configuration by Abc; (b) is selected by Abc.

For scenario (a), the violation (i.e., ) will not affect the correctness of our CI-based pruning. That is because, first, the violation only increases the lower bound of and makes harder to be pruned; second, any configuration pruned by will also be pruned by the final returned configuration .

For scenario (b), we can replace with in the final result, i.e., return the classifier trained using instead of for configuration . First, this is reasonable, because users would prefer a better classifier to a worse one . Second, since is selected by Abc, it is easy to check whether the assumption is violated or not by comparing with . Last, the -guarantee still holds.

In summary, we can add a post-processing step to eliminate the effect of such assumption violation. If there exists for the selected configuration , we would replace with in the final result.

4.1 Correctness of Algorithm 1

Combining Theorem 1 and 2, we can estimate the confidence interval for the real test accuracy, based on each probe over the sampled training data and the sampled testing data . Next, Theorem 3 shows that Algorithm 1 can successfully return an approximate best configuration w.h.p.

Corollary 1 (Confidence Interval).

With probability at least , .

Theorem 3 (Correctness).

With probability at least , Algorithm 1 returns the approximate best configuration with .

The correctness of our algorithm is independent of the choice of the scheduler.

5 Scheduler

Now we have shown that our proposed Abc can identify the approximate best configuration with high probability. This section focuses on the optimization part in Problem 1, i.e., how to minimize the total run time. Let be the probing time with a sampled training dataset size for configuration , and be the accumulated run time for probing configuration in Algorithm 1. Also, let and be the lower bound and upper bound respectively for configuration when the algorithm terminates. With these notations, the design of Scheduler in Abc can be expressed as a constrained optimization problem. Without loss of generality, assume is returned by Algorithm 1.

Problem 2 (Scheduling).

Design a scheduler to minimize + , subject to:

The objective function in Problem 2 has two terms. The first term is the time taken to identify the approximate best configuration. Since probing dominates the run time in each iteration, we use the total time of all probes as the proxy of the identification time. The second term is the time taken to get the trained classifier corresponding to the approximate best configuration after identification, which is a constant. The constraints in Problem 2 ensure that all the configurations are pruned except , and are necessary for the termination of Algorithm 1.

To solve Problem 2, we begin with studying the properties of the ‘oracle’ optimal scheduling scheme when it has access to as a function of and respectively, i.e., and , after the samples are drawn. We claim that the optimal scheduling scheme with this oracle access probes each configuration uniquely once, since otherwise we can always reduce the total run time by only keeping the last probe. After removing the constant term, our objective function can be rewritten as . Furthermore, by applying the method of Lagrange multipliers, we obtain the conditions the optimal solution must satisfy:

(9)

Now, since we do not have oracle access to and , there is no closed-form formula to decide the optimal sample size for configuration . To solve this challenge, We propose a scheduling scheme GradientCI with two parts.

First, we use the gradient of the run time with respect to the confidence interval to determine the configuration to probe next. We depict this strategy in Algorithm 2. GradientCI first sorts the remaining configuration set in descending order of the upper bound (line 3), and make a guess () on the best configuration . Next, it compares the gradient with : If is smaller, then configuration with the largest upper bound is picked for the next probe (line 4); otherwise, configuration with the second largest upper bound is picked (line 5). Here, denotes the run time difference between the recent two consecutive probes on , and serves as the proxy of (similar for ). The choice between and others is based on the first condition in Equation (9). Intuitively, if the lower bound of grows faster (per time spent) than all the other configurations’ upper bounds’ decrease, then we opt to probe . The choice of among to is based on the second condition in Equation (9), towards attaining the same upper bound for them.

Second, we design the sample size sequence within each configuration. As shown in line 6 of Algorithm 2, we utilize a common trick called geometric scheduling, which was used in prior work to increase the sample size for a single configuration [Provost, Jensen, and Oates1999]. We further derive the closed-form for the optimal step size , when is a power function over the sample size, i.e., where is a real number. The optimal step size follows . Details can be found in Appendix.

1 Input: Remaining configurations ;
2 Output: Configuration for next probe ;
3 sort_by_upper_bound();
4 if  then  ;
5else  ;
;
  // Geometrical increase
6 return ;
Algorithm 2 SchedulerGradientCI

Performance Analysis for GradientCI. The analysis is difficult for arbitrary and . But when and are convex functions of and respectively, we are able to prove a 4-approx guarantee for GradientCI with respect to the oracle optimal run time when . Furthermore, as the increase of , the total run time tends to decrease significantly. Intuitively, the convex condition means that the increase of sample size has a diminishing return to the change of the CI – as Algorithm 1 proceeds, it takes longer to attain the same increase (decrease, resp.) on the lower bound (upper bound, resp.).

Theorem 4 (GradientCI 4-Approx).

If and are convex, GradientCI provides a 4-approx guarantee to the oracle optimal run time when , with probability at least .

6 Experiments

This section evaluates the efficiency and effectiveness of our Abc module. First, we evaluate whether Abc successfully identifies top configuration and meanwhile reduce the total run time. We also compare different scheduling schemes in Appendix.

Figure 3: Speedup Compared to the Full-run
Figure 4: Accuracy Comparison Between Full-run and Abc

Configurations. We focus on the task of classifying featurized data in our evaluation. Specifically, we choose five widely used and high-performance learners: LogisticRegression, LinearSVM, LightGBM, NeuralNetwork, and RandomForest. Each classifier is associated with various hyperparameters, e.g., the number of trees in RandomForest and the penalty coefficient in LinearSVM. In total they have 29 discrete or continuous hyperparameters. In our experiments, we use random search to generate each hyperparameter value from its corresponding domain.

Datasets. We evaluate with five large-scale machine learning benchmarks that are publicly available. As discussed in introduction, the motivation of Abc is to handle large datasets and fast identify the approximate best configuration. Thus, the datasets evaluated in our experiments are all at the scale of millions of records () and with up to 10K features (). We do not use the AutoML benchmarks such as HPOlib [Eggensperger et al.2013] or OpenML [Vanschoren et al.2013], which mainly contain small or median-sized datasets (up to 50K records). The statistics of each dataset are depicted in Table 1

. We used min-max normalization for all datasets, and n-gram extraction as well as model-based top-K feature selection for TwitterSentiment.

Algorithms. We compare our proposed Abc with the standard approach named Full-run. For each configuration, Full-run first trains the classifier with full training data, and then tests it on the full testing data. Afterwards, it returns the configuration with the highest testing accuracy. This method is supported in mature tools like scikit-learn and Azure ML. Existing approaches to best configuration identification, such as DAUB [Sabharwal, Samulowitz, and Tesauro2016] or successive-halving [Jamieson and Talwalkar2016]

, are heuristics without accuracy guarantee. Our solution and such heuristics are not apple-to-apple comparison, as they cannot ensure

-approximation guarantee on accuracy. Nevertheless, we conduct a best effort comparison with Successive-halving. In Appendix we conduct a best-effort comparison.

Setup. We conducted our evaluation on a VM with 8 cores and 56 GB RAM. The initial training sample size and testing sample size are 1000 and 2000 respectively. The geometry step size is set to be . . Since is under the term, the result is not sensitive to . We also conduct experiments with varying , as shown in Appendix.

We use the same set of sampled configurations for both Full-run and Abc. We vary the number of input configurations from 5 to 80. Since we focus on large datasets, it already takes a day or half to finish Full-run with 80 configurations for a single dataset. So unlike the case of small datasets, 80-100 is a realistic number because that is how many configurations a user can try with Full-run within a reasonable time.

6.1 Abc vs. Full-run

We compare Abc against Full-run from two perspectives, run time and accuracy. We first compute the speedup achieved by Abc, where speedup is defined as the ratio between Full-run’s total run time and ours. Next, we compare the configuration returned by our Abc with the best configuration provided by Full-run in terms of real test accuracy.

Efficiency Comparison. In practice, Abc is used in two scenarios. During exploration, users want to try a few configurations (e.g., verifying usefulness of a few new features) as an intermediate step. The identification result will decide the follow-up trials, but it does not serve as the final configuration, and does not require full training. At the end of the exploration, users need to get the trained classifier corresponding to the top configuration. Thus, we evaluate the run time speedup in these two scenarios: (a) we first compare the identification time between our Abc and Full-run as depicted in Figure 3(a); (b) we then compare the total run time including the time to train the final classifier, in Figure 3(b). Our solution is on average 190 faster than Full-run in scenario (a), and is on average 60 faster than Full-run in scenario (b). Furthermore, 23 out of 25 experiments (i.e., 5 different datasets times 5 different configuration set size) has at least speedup in scenario (a), and 22 out of 25 experiments achieve at least speedup in scenario (b). This means that the run time of Abc is even faster than fully evaluating one average configuration in most cases, which further means even a perfectly distributed Full-run can’t beat the non-distributed Abc.

The speedup on dataset TwitterSentiment is consistently lower than other datasets. This is mainly because TwitterSentiment is one order of magnitude smaller than the other datasets. With the same sample size, the sampling ratio is higher than the other datasets, which causes lower speedup.

Effectiveness Comparison. As illustrated in Figure 4, Abc successfully identifies the configuration whose real test accuracy is within 0.01 from the best configuration’s real test accuracy in all of our experiments. In particular, when 40 or 80, Abc successfully identifies the exact best configuration for FlightDelay, NYCTaxi, and HIGGS. The largest deviation is around 0.0068 when the number of configurations is 20 for HIGGS.

Takeaway. Compared to Full-run, our proposed Abc can successfully identify a competitive or identical best configuration but with much less time.

6.2 CI-based pruning vs. Successive-halving

Next, we compare our proposed CI-based pruning with Successive-halving. Successive-halving was proposed as a pruning strategy to evaluate iterative training configurations with a resource budget of the total number of iterations of all configurations. We modify it to use the total sample size as the resource budget. In each round, it trains a classifier with the sampled data for each remaining configuration, and then eliminates the half of the low-performing configurations. It repeats until there is only one remaining configuration.

Since the two solutions are designed to satisfy different constraints (accuracy loss and resource), they are not directly comparable. We do our best to evaluate them in two scenarios: (a) with no resource constraint; and (b) with resource constraint. We introduce a metric, called relative accuracy loss, to measure the difference between the returned configuration and the best configuration in terms of the test accuracy: . The smaller is, the better.

(a) Speedup vs.
(b) Boxplot of
Figure 5: CI-based pruning vs. Successive-halving in Scenario (a)

Scenario (a): no resource constraint. We run Successive-halving with identical sample size sequence as Abc, to compare the CI-based pruning and the point-based halving strategy. We perform the same set of experiments as in the main experiment section for Successive-halving. We depict the comparison between Successive-halving and our Abc in Figure 5. The x-axis in Figure 4(a) refers to the relative accuracy loss compared to the best configuration by Full-run, y-axis is the speedup compared to the run time of Full-run, and each point corresponds to a specific experiment with a certain dataset and . We can see that Successive-halving has a similar speedup as Abc over Full-run. However, the relative accuracy loss can be an order of magnitude larger than that of Abc, e.g., vs. . This is because the pruning performed in Successive-halving is based on the ranking of the current test accuracy. On the contrary, Abc uses confidence interval of the real test accuracy to perform safer pruning. Figure 4(b) presents a boxplot summarizing the relative accuracy loss for our solution and Successive-halving respectively. On average, the relative accuracy loss for our CI-based solution is (all below 1%), and 2% for Successive-halving (up to 8%), which is nearly ten times larger.

We talk about the result of scenario (b) in the Appendix.

7 Related Work

AutoML. AutoML has gained increasing attention in the past few years. The scope of AutoML includes automated feature engineering, model selection, and hyperparameter tuning process. Some prevailing AutoML tools are Auto-sklearn for Python [Feurer et al.2015] and Auto-Weka for Java [Thornton et al.2013]. Most research focus is devoted to the search strategy, i.e., which configurations to evaluate. The strategies can be broadly categorized as grid search [Pedregosa et al.2011], random search [Bergstra and Bengio2012], spectral search [Hazan, Klivans, and Yuan2018], Bayesian optimization [Hutter, Hoos, and Leyton-Brown2011, Eggensperger et al.2013, Bergstra et al.2011, Snoek, Larochelle, and Adams2012], meta-learning [Feurer et al.2015]

, and genetic programming 

[Olson et al.2016]. Few studies address the efficiency issue in ranking these configurations on large datasets. TuPAQ [Sparks et al.2015] and HyperDrive [Rasley et al.2017] are two systems which focus on hyperparameter tuning when all the configurations correspond to iterative training processes. They distribute the configurations into multiple machines, and use heuristic early stopping rules for training iterations. [Jamieson and Talwalkar2016] further models this problem as a non-stochastic multi-armed bandit process, where each arm corresponds to a configuration, each pull corresponds to a few training iterations, and the reward is the intermediate accuracy on the test data. Recognizing the difference with the stochastic bandit process, they propose a Successive-halving pruning strategy in the fixed budget setting. They focus on this setting because they have found it difficult to derive the confidence bounds of real test accuracy based on limited training iterations. Hyperband [Li et al.2017] uses Successive-halving as a building block and tries to vary the number of random configurations under the same budget. While Hyperband suggests that the notion of resource can be generalized from training iterations to sample size of training data, we should notice that it is now possible to derive confidence bounds of real test accuracy based on sampled training data. Therefore, our Abc can be used to replace Successive-halving in this scenario to achieve lower accuracy loss. In the Bayesian optimization framework, RoBO [Klein et al.2017] treats the sample size as a hyperparameter, and uses random sample size to evaluate each configuration and a kernel function to extrapolate the real test accuracy.

Generalization Error Bounds. Generalization error bound has been studied extensively [Zhou2002, Koltchinskii et al.2000, Bousquet and Elisseeff2002], among which VC-bound [Vapnik1999] is a well-known technique for bounding the generalization error. The main idea behind VC-bounds is to use VC-dimension to characterize the complexity of the hypothesis class. Besides VC-dimension, other existing techniques for deriving generalization error bounds include covering number [Zhou2002], Rademacher complexity [Koltchinskii et al.2000], and stability bound [Bousquet and Elisseeff2002]. While the definition of generalization error bounds is different from the confidence bound needed for Abc, they have been used in other work to guide progressive sampling for a single configuration [Elomaa and Kääriäinen2002].

8 Discussion and Conclusion

We studied the problem of efficiently finding approximate best configuration among a given set of training configurations for a large dataset. Our CI-based progressive sampling and pruning solution Abc can successfully identify a top configuration with small or no accuracy loss, in much less time than the exact approach. The CI-based pruning is more robust than pruning based on point estimates.

There are multiple use cases that can benefit from our proposed Abc. The input of Abc can be either specified by the users based on their domain knowledge, or generated from an AutoML search algorithm. Our Abc module can help data scientists identify a top configuration faster. As they iteratively refine it, they can use Abc to verify whether altering part of the configuration (such as changing features) boosts the performance, by invoking Abc with the old and new configurations. In addition, our confidence bounds can be potentially used to accelerate Bayesian optimization and spectral search in large datasets, which is interesting future work.

References

  • [Bergstra and Bengio2012] Bergstra, J., and Bengio, Y. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13(Feb):281–305.
  • [Bergstra et al.2011] Bergstra, J. S.; Bardenet, R.; Bengio, Y.; and Kégl, B. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, 2546–2554.
  • [Bousquet and Elisseeff2002] Bousquet, O., and Elisseeff, A. 2002. Stability and generalization. Journal of machine learning research 2(Mar):499–526.
  • [Eggensperger et al.2013] Eggensperger, K.; Feurer, M.; Hutter, F.; Bergstra, J.; Snoek, J.; Hoos, H.; and Leyton-Brown, K. 2013. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice.
  • [Elomaa and Kääriäinen2002] Elomaa, T., and Kääriäinen, M. 2002. Progressive rademacher sampling. In AAAI’02, 140–145.
  • [Feurer et al.2015] Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; and Hutter, F. 2015. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, 2962–2970.
  • [Hazan, Klivans, and Yuan2018] Hazan, E.; Klivans, A.; and Yuan, Y. 2018. Hyperparameter optimization: A spectral approach. In ICLR’18.
  • [Hutter, Hoos, and Leyton-Brown2011] Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2011. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, 507–523. Springer.
  • [Jamieson and Talwalkar2016] Jamieson, K., and Talwalkar, A. 2016. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, 240–248.
  • [Klein et al.2017] Klein, A.; Falkner, S.; Mansur, N.; and Hutter, F. 2017. Robo: A flexible and robust bayesian optimization framework in python. In NIPS 2017 Bayesian Optimization Workshop.
  • [Koltchinskii et al.2000] Koltchinskii, V.; Abdallah, C. T.; Ariola, M.; Dorato, P.; and Panchenko, D. 2000. Improved sample complexity estimates for statistical learning control of uncertain systems. IEEE Transactions on Automatic Control 45(12):2383–2388.
  • [Li et al.2017] Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; and Talwalkar, A. 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. In ICLR’17.
  • [Mund2015] Mund, S. 2015. Microsoft azure machine learning. Packt Publishing Ltd.
  • [Olson et al.2016] Olson, R. S.; Bartley, N.; Urbanowicz, R. J.; and Moore, J. H. 2016.

    Evaluation of a tree-based pipeline optimization tool for automating data science.

    In

    Proceedings of the Genetic and Evolutionary Computation Conference 2016

    , GECCO’16.
  • [Pedregosa et al.2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine learning in python. JMLR 12:2825–2830.
  • [Provost, Jensen, and Oates1999] Provost, F.; Jensen, D.; and Oates, T. 1999. Efficient progressive sampling. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 23–32. ACM.
  • [Rasley et al.2017] Rasley, J.; He, Y.; Yan, F.; Ruwase, O.; and Fonseca, R. 2017. Hyperdrive: Exploring hyperparameters with pop scheduling. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, 1–13. ACM.
  • [Sabharwal, Samulowitz, and Tesauro2016] Sabharwal, A.; Samulowitz, H.; and Tesauro, G. 2016. Selecting near-optimal learners via incremental data allocation. In AAAI’16.
  • [Snoek, Larochelle, and Adams2012] Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, 2951–2959.
  • [Sparks et al.2015] Sparks, E. R.; Talwalkar, A.; Haas, D.; Franklin, M. J.; Jordan, M. I.; and Kraska, T. 2015. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing, 368–380. ACM.
  • [Thornton et al.2013] Thornton, C.; Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2013. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 847–855.
  • [Vanschoren et al.2013] Vanschoren, J.; van Rijn, J. N.; Bischl, B.; and Torgo, L. 2013. Openml: Networked science in machine learning. SIGKDD Explorations 15(2):49–60.
  • [Vapnik1999] Vapnik, V. N. 1999.

    An overview of statistical learning theory.

    IEEE transactions on neural networks

    10(5):988–999.
  • [Zhou2002] Zhou, D.-X. 2002. The covering number in learning theory. Journal of Complexity 18(3):739–767.

Appendix A Proof of Theorem 3

Proof.

Without loss of generality, we assume is the returned configuration by Algorithm 1. We denote the iterations in Algorithm 1 with at least one configuration pruned as set . We will prove that when all the confidence intervals at iterations correctly bound the real test accuracy (denoted as event ), the algorithm returns correct approximate configuration. In this event, we only need to consider the case when the confidence intervals do not expand in the iterations in as the sample size increases, because otherwise, we can force the confidence intervals not to expand, while still correctly bounding the real test accuracy. We show that when happens, for each pruned configuration , .

Consider iteration and let be the configuration with the highest lower bound in this iteration with lower bound , and be the pruned configuration in iteration . When happens, . And according to line 9 in Algorithm 1. So the pruned configuration must satisfy . Furthermore, as Algorithm 1 proceeds to iteration , according to line 7 in Algorithm 1. Thus, by induction on the iteration number . Recall that is the configuration with the highest lower bound in the last iteration and is returned by Algorithm 1. Also, since , we have for each pruned configuration in iteration .

Next, according to Corollary 1, the derived confidence interval is correct with probability at least for any configuration . Hence, for each round , the probability of having the correct confidence interval for each configuration is at least according to the union bound. In total, we have pruned configurations across all iterations, so . Based on the union bound over , we know that event happens with probability at least . Thus, we have with probability at least . ∎

Appendix B Proof of Step Size in Geometry Scheduling

Proof.

Assume is the optimal training sample size for each configuration when Algorithm 1 terminates, that minimizes the total run time. As we do not know the optimal in advance, we try probes with progressive sample size for each configuration , denoted as , where is the sample size in the last probe of when Algorithm 1 terminates. For a fixed , we assume that the last probe of all the other configurations uses their optimal training sample size, i.e., . Thus, the termination point for must satisfy (a) , otherwise would be a smaller sample size for configuration than ; and (b) , otherwise Algorithm 1 would terminate at instead of .

Based on the above property, we first claim that with geometric schedule (i.e., , where is the sample size in the initial probe and is the geometry step size), the accumulated run time is asymptotically equivalent to the optimal run time (i.e., [Provost, Jensen, and Oates1999]. Recall that is the probing time with a sampled training dataset of size for configuration .

We further minimize the worst case ratio between the accumulated run time and the optimal run time, i.e., . Since , the worst case occurs when . By replacing each term with and solving , we can derive a closed form solution for . When the step size follows , is minimized and it is guaranteed that in any case. For instance, when the training time increases linearly with the sample size , i.e., , then we should set to 2, which means we should double the sample size as the probing proceeds for each configuration. In fact, Provost et.al. [Provost, Jensen, and Oates1999] set heuristically and found good empirical performance. Our analysis provides a theoretical justification for that heuristic. ∎

Appendix C Proof of Theorem 4

Proof.

We focus on the case where all the confidence intervals correctly bound the test accuracies, which occurs with probability at least . In this case, for each configuration, its lowest upper bound and highest lower bound in all the rounds still bound the real test accuracy correctly. So without loss of generality, we can assume that with the geometric increase of the sample size, the confidence interval of each configuration shrinks as Algorithm 1 proceeds. That is, for each configuration the upper (lower) bound is monotonically decreasing (increasing) as Algorithm 1 runs.

We illustrate our analysis with the help of Figure 6. Without loss of generality, we consider as the best configuration . Each vertical line corresponds to one configuration, with shrinking confidence intervals as Algorithm 1 proceeds. The red and blue lines with the same length depict a corresponding pair of upper and lower bound after a particular probe in Algorithm 1. Let be the optimal solution in Equation 9, as shown by the solid horizontal black line in Figure 6. Furthermore, as shown in Figure 6, let be the smallest lower bound of that is no less than , and be the largest upper bound of that is no larger than , where . In the following, we prove that with our proposed scheduling scheme GradientCI, for each configuration , , the upper confidence bound cannot cross below the red solid line in Figure 6, i.e., Algorithm 1 terminates with .

Figure 6: Analysis of GradientCI

For any , we will prove by induction that at every iteration of Algorithm 1. First, it is obvious that in the first iteration (or probe), for any . Next, suppose after the iteration of Algorithm 1, we will show that after the iteraton for any and . Since one probe is performed in each iteration, we only need to prove that for the probing configuration , still holds when . Recall that in each iteration, GradientCI makes the guess that the configuration with largest upper bound is the best configuration. In the following, we discuss two cases depending on whether that guess is right in the iteration. For notation simplicity, let be the probing configuration in the iteraton.

Case 1. GradientCI has a wrong guess of the best configuration . Suppose is speculated as the best configuration. By definition, . Hence, if the probing configuration is , then we have shown that . Otherwise, the probing configuration must be the one with the second highest upper bound, according to GradientCI. In that case, since is also compared against when identifying the configuration with second highest upper bound, and by definition. Thus, .

Case 2. GradientCI has a correct guess on . If is probed, then we are done since does not change for . Otherwise, the probing configuration is the one with the second highest upper bound, according to GradientCI. In that case, we prove after the iteration by contradiction. If after the iteration, then must equal at the beginning of the iteration since we assume holds after the iteration. Next, we will show at the beginning of the iteration for all .

First, for , , and at the beginning of the iteration in Algorithm 1, must equal , since otherwise can not be with the second highest upper bound. Recall the is the candidate configurations so far. Second, for and , we will show must have been pruned when equals by contradiction. Otherwise (i.e., ), and is thus pruned before the iteration since , which contradict with the fact that .

Hence, we have at the beginning of the iteration for all . Thus, , since otherwise all other configurations are pruned. As a consequence, we have , since decreases with the decrease of and increases with the decrease of according to the convex condition. Then based on GradientCI, should be probed, which contradicts with the assumption that the probing configuration is with the second highest upper bound. In all, .

Combining case 1 and case 2, we now have shown that when Algorithm 1 terminates, , . Equivalently speaking, each configuration () probes at most one more time compared to the optimal scheme (the solid horizontal black line in Figure 6). In addition, given a configuration, each probe’s run time is twice of that in its previous probe, since we have set . Thus, the accumulated run time of GradientCI is at most 4 times of the optimal runtime for any configuration , where . That is:

where is the optimal run time corresponding to the optimal scheme and is the accumulated run time for in GradientCI. In the worst case, is probed all the way till with full training data. With , we have .

Since and , we have . ∎

Appendix D Extra Experiments

d.1 Varying in our CI-based Framework

(a) Speedup
(b) Accuracy Loss
Figure 7: Speedup vs. Accuracy Loss with Varying

In this experiment, we compare the real accuracy loss and the speedup with varying input accuracy loss tolerance . As shown in Figure 7, with the increase of the input , both the speedup and the real accuracy loss increase. First, as increases, the pruning condition is easier to be satisfied, leading to faster termination of Algorithm 1. Second, with smaller run time (i.e., resource) on each configuration, the CIEstimator tends to be less accurate. Consequently, the real accuracy loss typically increases as the increase of . Notably, as depicted in Figure 6(b), the real accuracy loss increases slightly as increase. Specifically, even when , the real accuracy loss is 0.012 for TwitterSentiment, and below 0.004 for other datasets. This means that even with large , our Abc can still return a configuration with competitive performance as the best configuration.

d.2 CI-based pruning vs. Successive-halving

Scenario (b): varying time constraint

(a) FlightDelay
(b) NYCTaxi
(c) HEPMASS
(d) HIGGS
Figure 8: CI-based pruning vs. Successive-halving in Scenario (b)

In addition to scenario (a) with no resource constraint, we also study the performance of our CI-based framework and Successive-halving, when we impose resource constraint on these two algorithms. In general, our CI-based framework dominates Successive-halving with varying resources, i.e., with the same resource, our framework returns the configuration with higher testing accuracy than that provided by Successive-halving.

For Successive-halving, the resource budget is controlled by the initial sample size. We vary the initial training sample size in Successive-halving starting from 250 and increase the initial training sample size by 2 every time. The initial test sample size is always twice of the initial training sample size. Each point in the red line of Figure 8 corresponds to one initial sample size, and the left most point has 250 initial training samples. As a result, we can attain various run time of Successive-halving, and we further normalize them as the percentage of Full-run’s run time. For our CI-based framework, we add the option for it to terminate at any iteration. In Algorithm 1, we output a best-guess configuration at the end of each iteration. Specifically, we compare the configuration (with the highest lower bound so far) against the configuration (with the highest upper bound so far), and output the one with smaller gap between its lower bound and all other configuration’s upper bound. The intuition is that, in order to prune all other configurations, we need to compare the lower bound of the output configuration with the upper bound of all other configurations, and we would like this gap to be as small as possible such that the accuracy loss between the output configuration and the best configuration is small. Each point in the blue line of Figure 8 corresponds to one such best-guess configuration.

In Figure 8, x-axis is the run time percentage taken compared to Full-run, and y-axis is the real test accuracy for the returned configuration. From Figure 8, we can see that our CI-based framework dominates Successive-halving. In particular, the test accuracy provided by Successive-halving is much worse than that returned by our framework in Figure 8(a)(b); while in Figure 8(c)(d) Successive-halving takes much more time to reach the same test accuracy as that in our framework. In general, when starting with larger initial training sample size, Successive-halving can return better configuration. This is because the point estimation in Successive-halving is more accurate when using larger training sample size. However, this in turn increases the total run time of Successive-halving, making it inferior to our CI-based framework. The result suggests that the Abc is also useful in the resource-constrained scenario, though it was not designed for that scenario.

d.3 Comparison of Scheduling Schemes

Figure 9: Comparison of Different Scheduling Schemes

Within the CI-based framework Abc, we now empirically study the impact of three different scheduling schemes, i.e., GradientCI, Ucb, and RoundRobin.

Ucb is a widely used scheduling scheme in multi-arm bandit problem. As indicated by the name, in each iteration Ucb always picks the configuration with the highest upper confidence bound for probing. The intuition is that the upper bound reflects the potential of this particular configuration, and thus the configuration with higher upper bound deserves more exploration. To some extent, UCB pushes the upper bound of each configuration to end up with the same value, which kind of matches the second condition in Equation 9. However, UCB does not take lower bound’s growth rate and upper bound’s decrease rate into consideration, i.e., the first condition in Equation 9.

RoundRobin allocates resources (i.e., probes) evenly among the non-pruned configurations. Specifically, RoundRobin chooses the configuration with the smallest number of probes as the in each iteration, replacing line 10 in Algorithm 1.

We perform the same set of experiments as that in the main experiment section, but with different scheduling schemes in our CI-based framework. First, we observe that in most cases, when applying different scheduling schemes, the accuracy loss of the returned configuration remains almost the same. However, the speedup differs from GradientCI to Ucb and RoundRobin. Figure 9 depicts the speedup and relative accuracy loss achieved by different scheduling schemes. Each point refers to a particular dataset and a configuration set size . With the same relative accuracy loss, GradientCI can achieve the highest speedup in most cases. We notice that RoundRobin performs much more slowly in a few cases (speedup below 5 while the other two schedulers achieve over 20 speedup). This implies the non-adaptive scheduling can waste resources. GradientCI and Ucb are more robust. The average speedup of GradientCI and Ucb are 190 and 128 respectively. That shows the benefit of taking the speed of CI change into consideration during scheduling.

Note that the Ucb method evaluated in this section is an enhanced algorithm of DAUB [Sabharwal, Samulowitz, and Tesauro2016]. Both Ucb and DAUB use the same scheduling scheme, but Ucb uses our novel CIEstimator and CI-based pruning technique to ensure the -guarantee.