The goal of our work is to develop novel practical methods to enhance tractability of Data Science practice in the era of Big Data. Consider, for example, the following very common scenario: A Data Science practitioner is given a data set comprising a training set, a validation set, and a collection of classifiers in an ML toolkit, each of which may have numerous possible hyper-parameterizations. The practitioner would like to determine which classifier/parameter combination (hereafter referred to as “learner”) would yield the highest validation accuracy, after training on all examples in the training set. However, the practitioner may have quite limited domain knowledge of salient characteristics of the data, or indeed of many of the algorithms in the toolkit.
In such a scenario, the practitioner may inevitably resort to the traditional approach to finding the best learner [cf. 12]
, namely, brute-force training of all learners on the full training set, and selecting the one with best validation accuracy. Such an approach is acceptable if the computational cost of training all learners is not an issue. However, in the era of Big Data, this is becoming increasingly infeasible. Web-scale datasets are proliferating from sources such as Twitter, TREC, SNAP, ImageNet, and the UCI repository, particularly in domains such as vision and NLP. ImageNet datasets can exceed 100 gigabytes, and the recent “YouTube-Sports-1M” video collection exceeds 40 terabytes. Moreover, the diverse set of learners available in today’s ML packages[18, 24, 26, 22]
are continually expanding, and many of the most successful recent algorithms entail very heavy training costs (e.g., Deep Learning neural nets with Dropout).
The above factors motivate a search for techniques to reduce training cost while still reliably finding a near-optimal learner. One could consider training each learner on a small subset of the training examples, and choose the best performing one. This entails less computation, but could result in significant loss of learner accuracy, since performance on a small subset can be a misleading predictor of performance on the full dataset. As an alternative, the small-subset results could be projected forward using parameterized accuracy models to predict full training set accuracy. Creating such models is, however, a daunting task , potentially needing prior knowledge about learners and domain, characteristic features of the data, etc.
In this paper, we develop a novel formulation of what it means to solve the above dual-objective problem, and we present a novel solution approach, inspired by multi-armed bandit literature [3, 29, 27, 1]. Our method develops model-free, cost-sensitive strategies for sequentially allocating small batches of training data to selected learners, wherein “cost” reflects misallocated samples that were used to train other learners that were ultimately not selected. We express the cost in terms of the regret of the approach, comparing the algorithm’s cost with that of an oracle which only allocates data to the best learner.
Our main contributions are as follows. First, we give a precise definition of a new ML problem setting, called the Cost-Sensitive Training Data Allocation Problem. Second, we present a simple, knowledge-free, easy-to-use and practical new algorithm for this setting, called DAUB (Data Allocation with Upper Bounds). Third, we give empirical demonstrations that DAUB achieves significant savings in training time while reliably achieving optimal or near-optimal learner accuracy over multiple real-world datasets. Fourth, we provide theoretical support for DAUB in an idealization of the real-world setting, wherein DAUB can work with noiseless accuracy estimates when training on samples, in lieu of actual noisy estimates. The real-world behavior of DAUB will progressively approach the idealized behavior as becomes large. In this setting, we establish a bound on accuracy of learners selected by DAUB, a sub-linear bound on the data misallocated by DAUB, and an associated bound on the computational training cost (regret).
Related work on traditional bandit strategies mentioned above, such as the celebrated UCB1 29, 1], presume that additional trials of a given arm yield stationary payoffs. Whereas in our scenario, additional data allocations to a learner yield increasing values of its accuracy. There are also existing methods to optimize a single arbitrary function while minimizing the number of evaluations [cf. 23]. These also do not fit our setting: we are dealing with multiple unknown but well-behaved functions, and wish to rank them on estimated accuracy after training on the full dataset, based on their upper-bounds from much fewer samples.
Somewhat related is algorithm portfolio selection  which seeks the most suitable algorithm (e.g., learner) for a given problem instance, based on knowledge from other instances and features characterizing the current instance. Note, however, that most selection algorithms use parameterized accuracy models which are fit to data [e.g., 20]. Also related is work on hyper-parameter optimization, where one searches for novel configurations of algorithms to improve performance [28, 7, 6, 5] or a combination of both . An example is Auto-Weka , which combines selection and parameter configuration based on Bayesian optimization [cf. 11]. Predicting generalization error on unseen data has in fact been recognized as a major ML challenge .
A recent non-frequentist approach  takes a Bayesian view of multi-armed bandits, applicable especially when the number of arms exceeds the number of allowed evaluations, and applies it also to automatic selection of ML algorithms. Like some prior methods, it evaluates algorithms on a small fixed percentage (e.g., 10%) of the full dataset. Unlike the above approaches, we do not assume that training (and evaluation) on a small fixed fraction of the data reliably ranks full-training results.
Finally, Domhan et al. 
recently proposed extrapolating learning curves to enable early termination of non-promising learners. Their method is designed specifically for neural networks and does not apply directly to many classifiers (SVMs, trees, etc.) that train non-iteratively from a single pass through the dataset. They also do not focus on a theoretical justification and fit accuracy estimates to a library of hand-designed learning curves.
2 Cost-Sensitive Training Data Allocation
We begin by formally defining the problem of cost-sensitive training data allocation. As before, we use learner to refer to a classifier along with a hyper-parameter setting for it. Let be a set of learners which can be trained on subsets of a training set and evaluated on a validation set . Let . For , let denote the set .
For , let be a cost function denoting expected computational cost of training learner when training examples are drawn uniformly at random from .111While we define the core concepts in terms of expected values suitable for a formal definition and idealized analysis, the actual DAUB algorithm will operate on observed values of on particular subsets of training examples chosen at runtime. We make two common assumptions about the training process, namely, that it looks at all training data and its complexity grows at least linearly. Formally, and for .
For , let be an accuracy function where denotes expected accuracy of on when trained on training examples chosen at random from . The corresponding error function, , is defined as . Note that our tool also supports accuracy functions not tied to a fixed validation set (e.g., cross-validation) and other measures such as precision, recall, and F1-score; our analysis applies equally well to these measures.
We denote a training data allocation of training samples to learner by a pair . Let be a sequence of allocations to learners in . We will use to denote the induced subsequence containing all training data allocations to learner , i.e., the subsequence of induced by all pairs such that . In our context, if allocations and are in with , then .
Evaluating amounts to training learner on examples from and evaluating its accuracy. This, in expectation, incurs a computational cost of . In general, the expected training complexity or cost associated with under the data allocation sequence is .
Our goal is to search for an such that is maximized, while also ensuring that overall training cost is not too large relative to . This bi-objective criterion is not easy to achieve. E.g., a brute-force evaluation, corresponding to and , obtains the optimal but incurs maximum training cost of
for all suboptimal learners. On the other hand, a low-cost heuristicfor some and , incurs a small training overhead of only for each suboptimal , but may choose an arbitrarily suboptimal .
We seek an in-between solution, ideally with the best of both worlds: a bounded optimality gap on ’s accuracy, and a bounded regret in terms of data misallocated to sufficiently suboptimal learners. Informally speaking, we will ensure that learners with performance at least worse than optimal are allocated only training examples, i.e., an asymptotically vanishing fraction of . Under certain conditions, this will ensure that the training cost regret is sublinear. We next formally define the notions of suboptimality and regret in this context.
Let be a collection of learners with accuracy functions , , , and . A learner is called -suboptimal for if , and -optimal otherwise.
Let be a data allocation sequence for a collection of learners with accuracy functions , , and . The -regret of for is defined as:
The regret of is thus the cumulative cost of training all -suboptimal learners when using .
Definition 3 (Cost-Sensitive Training Data Allocation Problem).
Let be a set of learners, be a training set for containing examples, be a validation set, and , for , be the training cost and accuracy functions, resp., for learner , and be a constant. The Cost-Sensitive Training Data Allocation Problem is to compute a training data allocation sequence for and as well as a value such that:
for some fixed constant , and
-regret of is in .
A solution to this problem thus identifies an -optimal learner , trained on all of , incurring on no more than a constant factor overhead relative to the minimum training cost of , and with a guarantee that any -suboptimal learner incurred a vanishingly small training cost compared to training (specifically, as ).
3 The DAUB Algorithm
Algorithm 1 describes our Data Allocation using Upper Bounds strategy. The basic idea is to project an optimistic upper bound on full-training accuracy of learner using recent evaluations . The learner with highest upper bound is then selected to receive additional samples. Our implementation of DAUB uses monotone regression to estimate upper bounds on as detailed below, since observed accuracies are noisy and may occasionally violate known monotonicity of learning curves. Whereas in the noise-free setting, a straight line through the two most recent values of provides a strict upper bound on .
As a bootstrapping step, DAUB first allocates and training examples to each learner , trains them, and records their training and validation accuracy in arrays and , resp. If at the current point is smaller than at the previous point, DAUB uses a simple monotone regression method, making the two values meet in the middle.
After bootstrapping, in each iteration, it identifies a learner that has the most promising upper bound estimate (computed as discussed next) on the unknown projected expected accuracy and allocates times more examples (up to ) to it than what was allocated previously. For computing the upper bound estimate, DAUB uses two sources. First, assuming training and validation data come from the same distribution, provides such an estimate. Further, as will be justified in the analysis of the idealized scenario called DAUB*, also provides such an estimate under certain conditions, where
is the estimated derivative computed as the slope of the linear regression best fit line throughfor . Once some learner is allocated all training examples, DAUB halts and outputs along with the allocation sequence it used.
3.1 Theoretical Support for DAUB
To help understand the behavior of DAUB, we consider an idealized variant, DAUB*, that operates precisely like DAUB but has access to the true expected accuracy and cost functions, and , not just their observed estimates. As
grows, learning variance (across random batches of size) decreases, observed estimates of and converge to these ideal values, and the behavior of DAUB thus approaches that of DAUB*.
Let be the (unknown) target accuracy and be the corresponding (unknown) optimal learner. For each , let be an arbitrary projected upper bound estimate that DAUB* uses for when it has allocated training examples to . We will assume w.l.o.g. that is non-increasing at the points where it is evaluated by DAUB*.222Since DAUB* evaluates for increasing values of , it is easy to enforce monotonicity. For the initial part of the analysis, we will think of as a black-box function, ignoring how it is computed. Let . It may be verified that once drops below , DAUB* will stop allocating more samples to . While this gives insight into the behavior of DAUB*, for the analysis we will use a slightly weaker form that depends on the target accuracy rather than .
is a valid projected upper bound function if for all .
Define as if and as otherwise.
A key observation is that when using as the only source of information about , one must allocate at least examples to before acquiring enough information to conclude that is suboptimal. Note that depends on the interaction between and , and is thus unknown. Interestingly, we can show that DAUB* allocates to at most a constant factor more examples, specifically fewer than in each step and in total, if it has access to valid projected upper bound functions for and (cf. Lemma 1 in Appendix). In other words, DAUB*’s allocation is essentially optimal w.r.t. .
A careful selection of the learner in each round is critical for allocation optimality w.r.t. . Consider a simpler alternative: In round , train all currently active learners on examples, compute all and , and permanently drop from consideration if for some . This will not guarantee allocation optimality; any permanent decisions to drop a classifier must necessarily be conservative to be correct. By instead only temporarily suspending suboptimal looking learners, DAUB* guarantees a much stronger property: receives no more allocation as soon as drops below the (unknown) target .
The following observation connects data allocation to training cost: if DAUB* allocates at most training examples to a learner in each step, then its overall cost for is at most (cf. Lemma 2 in Appendix). Combining this with Lemma 1, we immediately obtain the following result regarding DAUB*’s regret:333All proofs are deferred to the Appendix.
Let and for be as in Definition 3. Let and be the allocation sequence produced by DAUB*. If the projected upper bound functions and used by DAUB* are valid, then .
In the remainder of the analysis, we will (a) study the validity of the actual projected upper bound functions used by DAUB* and (b) explore conditions under which -suboptimality of guarantees that is a vanishingly small fraction of , implying that DAUB* incurs a vanishingly small training cost on any -suboptimal learner.
3.1.1 Obtaining Valid Projected Upper Bounds
If for were arbitrary functions, it would clearly be impossible to upper bound by looking only at estimates of for . Fortunately, each is the expected accuracy of a learner and is thus expected to behave in a certain way. In order to bound DAUB*’s regret, we make two assumptions on the behavior of . First, is non-decreasing, i.e., more training data does not hurt validation accuracy. Second, has a diminishing returns property, namely, as grows, the additional validation accuracy benefit of including more training examples diminishes. Formally:
is well-behaved if it is non-decreasing and its discrete derivative, , is non-increasing.
These assumptions on expected accuracy are well-supported from the PAC theory perspective. Let be the projected upper bound function used by DAUB* for , namely the minimum of the training accuracy of at and the validation accuracy based expression . For DAUB*, we treat as the one-sided discrete derivative defined as for some parameter . We assume the training and validation sets, and , come from the same distribution, which means itself is a valid projected upper bound. Further, we can show that if is well-behaved, then is a valid projected upper bound function (cf. Lemma 3 in Appendix).
Thus, instead of relying on a parameterized functional form to model , DAUB* evaluates for certain values of and computes an expression that is guaranteed to be a valid upper bound on if is well-behaved.
3.1.2 Bounding Regret
We now fix as the projected upper bound functions and explore how -suboptimality and the well-behaved nature of together limit how large is.
For and a well-behaved accuracy function , define as if and as otherwise.
Let and for be as in Definition 3. Let be an -suboptimal learner, and be the allocation sequence produced by DAUB*. If and are well-behaved, then .
The final piece of the analysis is an asymptotic bound on . To this end, we observe that the derivative of any bounded, well-behaved, discrete function of behaves asymptotically as (cf. Proposition 1 in Appendix). Applying this to , we can prove that if , then is in (cf. Lemma 5 in Appendix).
This leads to our main result regarding DAUB*’s regret:
Theorem 3 (Sub-Linear Regret).
Let and for be as in Definition 3. Let and . Let . For all , suppose is well-behaved and . If DAUB* outputs as the training data allocation sequence along with a selected learner trained on all of , then:
-regret of is in ; and
If for all , then the -regret of is in .
Thus, DAUB* successfully solves the cost-sensitive training data allocation problem whenever for , is well-behaved and , i.e., training any suboptimal learner is asymptotically not any costlier than training an optimal learner. While more refined versions of this result can be generated, the necessity of an assumption on the cost function is clear: if a suboptimal learner was arbitrarily costlier to train than optimal learners, then, in order to guarantee near-optimality, one must incur a significant misallocation cost training on some reasonable subset of in order to ascertain that is in fact suboptimal.
3.1.3 Tightness of Bounds
The cost bound on misallocated data in Theorem 2 in terms of is in fact tight (up to a constant factor) in the worst case, unless further assumptions are made about the accuracy functions. In particular, every algorithm that guarantees -optimality without further assumptions must, in the worst case, incur a cost of the order of for every suboptimal (cf. Theorem 5 in Appendix for a formal statement):
Theorem 4 (Lower Bound, informal statement).
Let and be a training data allocation algorithm that always outputs an -optimal learner. Then there exists an -suboptimal learner that would force to incur a misallocated training cost larger than .
We first evaluate DAUB on one real-world binary classification dataset, “Higgs boson”  and one artificial dataset, “Parity with distractors,” to examine robustness of DAUB’s strategy across two extremely different types of data. In the latter task the class label is the parity of a (hidden) subset of binary features—the remaining features serve as distractors, with no influence on the class label. We generated 65,535 distinct examples based on 5-bit parity with 11 distractors, and randomly selected 21,500 samples each for and . For the Higgs and other real-world datasets, we first randomly split the data with a 70/30 ratio and selected 38,500 samples for from the 70% split and use the 30% as . We coarsely optimized the DAUB parameters at and based on the Higgs data, and kept those values fixed for all datasets. This yielded 11 possible allocation sizes: 500, 1000, 1500, 2500, 4000, 5000, 7500, 11500, 17500, 25500, 38500555Unfortunately some of our classifiers crash and/or run out of memory above 38500 samples..
Results for HIGGS and PARITY are as follows. The accuracy loss of the ultimate classifiers selected by DAUB turned out to be quite small: DAUB selected the top classifier for HIGGS (i.e. 0.0% loss) and one of the top three classifiers for PARITY (0.3% loss). In terms of complexity reduction, Table 1 shows clear gains over “full” training of all classifiers on the full , in both total allocated samples as well total CPU training time, for both standard DAUB as well as a variant which does not use training set accuracy as an upper bound. Both variants reduce the allocated samples by 2x-4x for HIGGS, and by 5x for PARITY. The impact on CPU runtime is more pronounced, as many sub-optimal classifiers with supra-linear runtimes receive very small amounts of training data. As the table shows, standard DAUB reduces total training time by a factor of 25x for HIGGS, and 15x for PARITY.
Figures 1 and 2 provide additional insight into DAUB’s behavior. Figure 1 shows how validation accuracy progresses with increasing training data allocation to several classifiers on the HIGGS dataset. The plots for the most part conform to our ideal-case assumptions of increasing accuracy with diminishing slope, barring a few monotonicity glitches666 In our experience, most of these glitches pertain to weak classifiers and thus would not significantly affect DAUB, since DAUB mostly focuses its effort on the strongest classifiers. due to stochastic sampling noise. We note that, while there is one optimal classifier (a parameterization of a Rotation Forest) with best validation accuracy after training on all of , there are several other classifiers that outperformed in early training. For instance, LADTree is better than until 5,000 examples but then flattens out.
Figure 2 gives perspective on how DAUB distributes data allocations among the 41 classifiers when run on the HIGGS dataset. The classifiers here are sorted by decreasing validation accuracy . While DAUB manages to select in this case, what’s equally critical is the distribution of allocated training data. The figure shows that DAUB allocates most of the training data to the top eight classifiers. Most classifiers receive 2500 or fewer samples, and only four classifiers receive more than 10k samples, with all of them within 1.5% of the optimal performance.
|Dataset||Application Area||Alloc.||Time (s)||Iter.||Alloc.||Time (s)||Speedup||Loss|
|Vehicle Sensing||vehicle mgmt.||1,578k||68,139||50||296k||5,603||12x||0.0%|
Finally, in Table 2 we report results of DAUB on Higgs plus five other real-world benchmarks as indicated: Buzz ; Covertype ; Million Song Dataset ; SUSY ; and Vehicle-SensIT . These experiments use exactly the same parameter settings as for HIGGS and PARITY. As before, the table shows a comparison in terms of allocated training samples and runtime. In addition it displays the incurred accuracy loss of DAUB’s final selected classifier. The highest loss is 1%, well within an acceptable range. The average incurred loss across all six benchmarks is 0.4% and the average speedup is 16x. Our empirical findings thus show that in practice DAUB can consistently select near-optimal classifiers at a substantial reduced computational cost when compared to full training of all classifiers.
We reiterate the potential practical impact of our original Cost-Sensitive Training Data Allocation problem formulation, and our proposed DAUB algorithm for solving this problem. In our experience, DAUB has been quite easy to use, easy to code and tune, and is highly practical in robustly finding near-optimal learners with greatly reduced CPU time across datasets drawn from a variety of real-world domains. Moreover, it does not require built-in knowledge of learners or properties of datasets, making it ideally suited for practitioners without domain knowledge of the learning algorithms or data characteristics. Furthermore, all intermediate results can be used to interactively inform the practitioner of relevant information such as progress (e.g., updated learning curves) and decisions taken (e.g., allocated data). Such a tool was introduced by Biem et al.  and a snapshot of it is depicted in Figure 3.
Our theoretical work on the idealized DAUB* scenario also reveals novel insights and provides important support for the real-world behavior of DAUB with noisy accuracy estimates. As dataset sizes scale, we expect that DAUB will better and better approach the idealized behavior of DAUB*, which offers strong bounds on both learner sub-optimality as well as regret due to misallocated samples.
There are many opportunities for further advances in both the theoretical and practical aspects of this work. It should be possible to develop more accurate bound estimators given noisy accuracy estimates, e.g., using monotone spline regression. Likewise, it may be possible to extend the theory to encompass noisy accuracy estimates, for example, by making use of PAC lower bounds on generalization error to establish upper bounds on learner accuracy. DAUB could be further combined in an interesting way with methods  to optimally split data between training and validation sets.
- Agrawal and Goyal  S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In COLT-2012, pp. 39.1–39.26, Edinburgh, Scotland, June 2012.
- Ali et al.  A. Ali, R. Caruana, and A. Kapoor. Active learning with model selection. In Proc. of AAAI-2014, 2014.
- Auer et al.  P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
- Baldi et al.  P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5, July 2014.
- Bergstra et al.  J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In NIPS, pp. 2546–2554, 2011.
- Bergstra and Bengio  J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.
Bergstra et al. 
J. Bergstra, D. Yamins, and D. D. Cox.
Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.In ICML-2013, 2013.
- Bertin-Mahieux et al.  T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
- Biem et al.  A. Biem, M. A. Butrico, M. D. Feblowitz, T. Klinger, Y. Malitsky, K. Ng, A. Perer, C. Reddy, A. V. Riabov, H. Samulowitz, D. Sow, G. Tesauro, and D. Turaga. Towards cognitive automation of data science. In Proc. of AAAI-2015, Demonstrations Track, Austin, TX, 2015.
- Blackard and Dean  J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24(3):131–151, 2000.
Brochu et al. 
E. Brochu, V. M. Cora, and N. de Freitas.
A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning.Technical Report UBC TR-2009-23, Department of Computer Science, University of British Columbia, 2009.
Caruana and Niculescu-Mizil 
R. Caruana and A. Niculescu-Mizil.
An empirical comparison of supervised learning algorithms.In ICML-2006, pp. 161–168, 2006.
- Domhan et al.  T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI-2015, 2015.
- Duarte and Hu  M. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. In Journal of Parallel and Distributed Computing, 2004.
- Feurer et al.  M. Feurer, J. Springenber, and F. Hutter. Initializing bayesian hyperparameter optimization via meta-learning. In Proc. of AAAI-2015, 2015.
Guerra et al. 
S. B. Guerra, R. B. C. Prudencio, and T. B. Ludermir.
Predicting the performance of learning algorithms using support vector machines as meta-regressors.In ICANN, 2008.
- Guyon et al.  I. Guyon, A. R. S. A. Alamdari, G. Dror, and J. M. Buhmann. Performance prediction challenge. In IJCNN-2006, pp. 1649–1656, Vancouver, BC, Canada, July 2006.
- Hall et al.  M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA Data Mining Software: An update. SIGKDD Explorations, 11(1), 2009.
- Hoffman et al.  M. D. Hoffman, B. Shahriari, and N. de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In AISTATS, pp. 365–374, 2014.
- Hutter et al.  F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown. Algorithm runtime prediction: Methods & evaluation. Artif. Intell., 206:79–111, 2014.
- Kawala et al.  F. Kawala, A. Douzal-Chouakria, E. Gaussier, and E. Dimert. Prédictions d’activité dans les réseaux sociaux en ligne. In Conférence sur les Modéles et l′Analyse des Réseaux Approches Mathématiques et Informatique (MARAMI), 2013.
- McCallum  A. McCallum. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
- Munos  R. Munos. From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 7(1):1–130, 2014.
- Pedregosa et al.  F. Pedregosa et al. Scikit-learn: Machine learning in Python. JMLR, 12:2825–2830, 2011.
- Rice  J. Rice. The algorithm selection problem. Advances in Computers, 15:65–118, 1976.
- Schaul et al.  T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber. PyBrain. JMLR, 2010.
- Scott  S. L. Scott. A modern Bayesian look at the multi-armed bandit. Appl. Stochastic Models Bus. Ind., 26:639–658, 2010.
- Snoek et al.  J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, 2012.
W. R. Thompson.
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25:285–294, 1933.
- Thornton et al.  C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proc. of KDD-2013, pp. 847–855, 2013.
Appendix A Appendix: Proof Details
Let and for be as in Definition 3. Let and . If the projected upper bound functions and used by DAUB* are valid, then it allocates to fewer than examples in each step and examples in total.
Proof of Lemma 1.
Suppose, for the sake of contradiction, that DAUB* allocates at least examples to learner at some point in its execution. Since and all allocation sizes are at most , . Further, since , the number of examples allocated to , is always incremented geometrically by a factor of at most , at some previous point in the algorithm, we must have . Since the projected upper bound function is non-increasing, at that previous point in the algorithm, by the definition of . On the other hand, since the projected upper bound function is valid, the projected upper bound for would always be at least .
Therefore, the algorithm, when choosing which learner to allocate the next set of examples to, will, from this point onward, always prefer (and possibly another learner appearing to be even better) over , implying that will never exceed its current value. This contradicts the assumption that DAUB* allocates at least examples to at some point during its execution.
For the bound on the total number of examples allocated to , let . Since DAUB* allocates fewer than examples to in any single step and the allocation sizes start at and increase by a factor of in each step, must have received precisely examples in total. This is smaller than . ∎
Let and for be as in Definition 3. Let and be the training data allocation sequence produced by DAUB*. If DAUB* allocates at most training examples to a learner in each step, then .
Proof of Lemma 2.
As in the proof of Lemma 1, let and observe that the data allocation subsequence for learner must have been . The corresponding training cost for is . By the assumption that grows at least linearly: for . It follows that:
This finishes the proof. ∎
If is well-behaved, then is a valid projected upper bound function.
Proof of Lemma 3.
Recall that is the minimum of and . Since is a non-increasing function and is already argued to be an upper bound on , it suffices to show that is also a non-increasing function of and .
|because is non-increasing|
The first inequality follows from the assumptions on the behavior of w.r.t. and its first-order Taylor expansion. Specifically, recall that is defined as for some (implicit) parameter . For concreteness, let’s refer to that implicit function as . Let denote the discrete derivative of w.r.t. . The non-increasing nature of w.r.t. implies , for any fixed , is a non-decreasing function of . In particular, for any . It follows that , as desired.
Thus, is non-increasing. Since by definition, we have , finishing the proof. ∎
For an -suboptimal learner with well-behaved accuracy function , .
Proof of Lemma 4.
If , then and the statement of the lemma holds trivially. Otherwise, by the definition of , . In order to prove , we must show that . We do this by using first-order Taylor expansion:
|by the above observation|
|since and is non-decreasing|
|by -suboptimality of|
Hence, . ∎
If is a well-behaved function for , then its discrete derivative decreases asymptotically as .
Proof of Proposition 1.
Recall the definition of the discrete derivative from Section 3 and assume for simplicity of exposition that the parameter is , namely, . The argument can be easily extended to . Applying the definition repeatedly, we get . If was , then there would exist and such that for all , . This would mean . This summation, however, diverges to infinity while is bounded above by . It follows that could not have been to start with. It must thus decrease asymptotically strictly faster than , that is, be . ∎
For an -suboptimal learner with a well-behaved accuracy function satisfying , we have that is in .
Proof of Lemma 5.
From Proposition 1, , implying . This means that a value that is and suffices to ensure for all large enough , that is, . Since is, by definition, no larger than , must also be . ∎
Proof of Theorem 3.
Since DAUB* never allocates more than training examples in a single step to for any , it follows from Lemma 2 that . In particular, .
The -regret of DAUB*, by definition, is . By Theorem 2, this is at most . Since the cost function is assumed to increase at least linearly, this quantity is at most . From Lemma 5, we have that and hence in . Plugging this in and dropping the constants and from the asymptotics, we obtain that the regret is .
Finally, if , then , which is simply . Since , this quantity is also in . It follows from the above result that the -regret of DAUB* is in , as claimed. ∎
Theorem 5 (Lower Bound, formal statement).
Let and be a training data allocation algorithm that, when executed on a training set of size , is guaranteed to always output an -optimal learner. Let and for be as in Definition 3. Let and be an -suboptimal learner. Then there exists a choice of such that is well-behaved, and allocates to more than examples, thus incurring a misallocated training cost on larger than .
Proof of Theorem 5.
We will argue that, under certain circumstances, must allocate at least examples to in order to guarantee