1. Introduction
The performance of machine learning algorithms crucially depends on their hyperparameters. Tuning hyperparameters is usually a tedious and expensive process. For this reason, there is a need for automated hyperparameter optimization (HPO) schemes that are sample efficient and robust. Bayesian optimization (BO) is a popular approach to optimize gradientfree functions, and has recently gained traction in HPO by obtaining stateoftheart results in tuning many modern machine learning models (Chen et al., 2018; Snoek et al., 2012; Melis et al., 2018).
BO optimizes an expensive gradientfree function by iteratively evaluating it at carefully chosen locations: it builds and sequentially updates a probabilistic model of the function, uses an acquisition function
to select the next location to evaluate, and repeats. Consider an example of optimizing a neural network: here, “locations” correspond to choosing a given architecture and hyperparameter configuration. The model is evaluated by optimizing the weights of the neural network on the training set (e.g., via SGD), and estimating its loss on a validation set. This estimated performance is returned to the HPO algorithm to guide the search. While for certain HPO algorithms their convergence to optimal configurations can be shown
(Srinivas et al., 2010; Wang and Jegelka, 2017), such analyses generally assume that the number of steps goes to infinity. In practice, however, BO is terminated after a finite number of iterations . After these iterations, BO outputs the best hyperparameters configuration based on the validation loss. One may notice several issues with this approach: (a) BO uses the validation metric to guide the search, and thus it may overfit to this metric, especially on small datasets; (b) Incorrectly fixing the number of BO iterations in advance can lead either to suboptimal solutions or a waste of computational resources.Despite the wide usage of BO for HPO, to the best of our knowledge, its potential for overfitting has not been studied. As we show in Section 3
, overfitting is indeed occurring, and exhibits different characteristics than classical overfitting in training machine learning algorithms. On the one hand, we can not mitigate overfitting by directly adding a regularization term due to the gradientfree nature of BO. On the other hand, classical early stopping in deep learning training
(Prechelt, 1996; Raskutti et al., 2014; Li et al., 2020) cannot be directly applied due to the explorative and global nature of BO. Finally, while crossvalidation is a common technique to detect and mitigate overfitting, it comes with high computational cost, and it is unclear how to effectively use it with BO.Although stopping criteria are critical for BO, only a few works (Nguyen et al., 2017; Lorenz et al., 2016) study automated termination. These methods rely on a preselected threshold for acquisition functions which degrades performance when misspecified. In (McLeod et al., 2018), the authors combine local and Bayesian optimization by selecting from multiple acquisition functions at each iteration, defining an automatic stopping rule for the resulting algorithm. However, this approach still comes with a stopping tolerance hyperparameter, which significantly affects results and yet has to be manually set by the user.
Contributions.
In this work, we propose a termination rule for BO building on crossvalidation that is problemadaptive and easy to incorporate into standard BO framework. The intuition of our method is the following: stop BO when the maximum plausible improvement becomes less than the (postcorrected) standard deviation of the crossvalidation metrics. In particular, our method relies on two components: (i) A highprobability bound on the gap of the validation metrics between the current best hyperparameter configuration and the optimal configuration
(Ha et al., 2019)and (ii) a stopping threshold based on the variance of the crossvalidation estimate of the generalization performance from
(Nadeau and Bengio, 2003).Our main contributions are as follows:

We present an empirical study of overfitting in BO, being the first to our knowledge to address this question.

We propose a simple yet powerful stopping criterion that is problem adaptive and interpretable. The method exploits existing BO components, thus, is easy to use in practice.

Our experimental results show that our method matches or outperforms other early stopping baselines by a large margin in test accuracy while maintaining a valuable speedup in computetime.
We present our empirical study of overfitting in BO in Section 3, and then introduce our stopping criterion in Section 4. In Section 5 we experimentally evaluate it on realworld hyperparameter optimization problems and compare it with the other baselines. We also provide further insight by discussing the related work and challenges for BO early stopping in Sections 6 and 2.
2. Related work
Overfitting and robustness in BO are relatively underexplored areas. Besides BO early stopping, another direction to robustify a solution is to consider distributional data shifts (Kirschner et al., 2020; Nguyen et al., 2020) or incorporate aleatoric uncertainty (Nguyen et al., 2017). In (Kirschner et al., 2020; Nguyen et al., 2020)
, the objective is to optimize the expected loss under the worst adversarial data distribution rather than the commonly used uniform distribution. The approach is used for HPO in
(Nguyen et al., 2020), where it also relies on crossvalidation and makes performance more robust under the data shift. However, it does not scale to higher dimensional problems. The aleatoric uncertainty in (Nguyen et al., 2017) is used to measure the sensitivity of the solution under perturbations of the input.Beyond BO, different stopping criteria were also proposed in other areas such as active learning
(Altschuler and Bloodgood, 2019; Ishibashi and Hino, 2020). In (Altschuler and Bloodgood, 2019), the authors predict a change in the objective function to decide when to stop. In (Ishibashi and Hino, 2020), the authors propose statistical tests to track the difference in expected generalization errors between two consecutive evaluations.The term early stopping
commonly refers to terminating the training loop of algorithms that are trained iteratively, such as neural networks optimized via SGD or XGBoost
(Prechelt, 1996; Raskutti et al., 2014; Li et al., 2020). This iterative training is exploited for BObased HPO in (Klein et al., 2017; Dai et al., 2019; Swersky et al., 2014) as a way to save resources and prevent overfitting. Their notion of early stopping is different, and in a way complementary to the method proposed in our paper. Hence, we refer to our proposal as BO early stopping.3. Overfitting in HPO
We empirically assess overfitting in BObased HPO and outline its characteristics. We consider tuning three common algorithms, i.e., Linear Model trained with SGD (LM)^{1}^{1}1It is implemented with SGDClassifier (logloss) and SGDRegressor in Scikitlearn.
, Random Forest (RF) and XGBoost (XGB), on 19 datasets from various sources, mostly from OpenML
(Vanschoren et al., 2014). We set 200 hyperparameter evaluations as the budget for BO and repeat each experiment with 10 seeds. Each combination of an algorithm, dataset and seed is referred to as an experiment throughout the paper. The detailed hyperparameter search space for the algorithms, the properties of the datasets and data splits, as well as the BO specification are listed in Appendix A. We use the same settings for evaluating our proposed early stopping method in Section 5.3.1. Observations
We now present our observations for BObased HPO from an overfitting perspective by considering the test error. During BO, we maintain an incumbent, i.e., the hyperparameters with the best validation error found so far. While the validation error of the incumbent is nonincreasing by definition, the test error corresponding to the incumbent may reveal a different picture. In the following, we use the BO results on one particular dataset to demonstrate interesting observations. Unless emphasised, the observations generalize to other settings.
3.1.1. Nonmonotonicity of the test error
In Fig. 1, we plot the validation and test errors of the incumbent as we tune LM, XGB and RandomForest algorithms on the op1009952 dataset with and without crossvalidation. While the validation error is indeed decreasing, the test error behaves nonmonotonically. This behavior contrasts with the “textbook” setting for with a minimal point between underfitting and overfitting. When tuning XGB and RandomForest on the same data, less overfitting is observed, indicating that some algorithms are more robust to their hyperparameters than others.
Crossvalidation is the de facto method to mitigate overfitting and we indeed observe an improvement in the test errors overall when using crossvalidation estimates in the HPO procedure. However, crossvalidation does not solve the overfitting problem, as we show in Fig. 1. Even with crossvalidation, the test errors can still increase (as the experiments for LM show).
3.1.2. Variance in BO experiments
For the op1009952 data, we compute the variances of validation and test errors at every BO iteration across 100 replicates for LM, XGB and RandomForest in Fig. 2. From Fig. 2, one can see again that the validation errors converge and the test errors are increasing on average for LM. The test error variance is much higher than the validation error when tuning on this dataset.
There are three sources of randomness in the BO experiments: (i) randomness in reshuffling the dataset and splitting the data into K folds (controllable by crossvalidation seed), (ii) randomness in the BO procedure including the random initialization and optimization of the acquisition function (controllable by BO seed), (iii) randomness in the model training, e.g., from stochastic gradient descent or model parameter initialization (controllable by training seed).
We study the impact of randomness inherited from these three sources in Fig. 3 by designing the following experiments: To estimate the variance from crossvalidation splits, we fix the BO seed and algorithm training seed, only allow dataset to be reshuffled, and repeat the BO experiments 10 times. Then we get one estimate of the variance from crossvalidation for every BO iteration. To make the estimate more reliable, we then repeat this experiment for 10 different configurations of BO seed and algorithm training seed (as an outerloop) to compute 10 estimates of the variances from crossvalidation. In the end we report the mean estimate of the variances from crossvalidation in Fig. 3 for every BO iteration. Similarly, we get 10 estimates of variance from BO (fixing crossvalidation seed and algorithm training seed) and algorithm training (fixing crossvalidation seed and BO seed) and report the mean of the variances from these two sources also in Fig. 3.
There are many observations one can make from Fig. 3. First, the variance from BO tends to decrease in both validation and test errors as BO proceeds, and it is the largest source of variance for tuning XGB and RF. The variance from algorithm training is the highest for LM while the lowest for XGB and RF. The variance from crossvalidation data splits is usually on a similar scale as the variance from algorithm training, at least for XGB and RF.
3.1.3. Why does overfitting happen?
As we have seen when tuning LM on the op1009952 dataset, the test errors behave drastically different from validation errors, while for XGB and RF, less overfitting is happening. We conjecture that this is because the correlation between the validation and test errors of the hyperparameter configurations is weak. We illustrate this correlation in Fig. 4 where we plot the test and validation errors for all hyperparameters observed in the experiments.
From Fig. 4, we indeed observe a weaker correlation between the test and validation errors for LM. In practice, the correlation between the test and validation errors can be indeed weak, due to the small size of datasets or data shifts. However, we do not have access to the test set during BO, thus we do not know how good the correlation is beforehand. Fortunately, when using crossvalidation, the reliability of the validation metrics can be estimated, and it serves as a key component of our stopping criterion.
In conclusion, we have shown that overfitting can indeed happen in BObased HPO, with perhaps unusual characteristics compared to “classical” overfitting. Running BO longer does not necessarily lead to better generalization performance, thus some form of early stopping for BO may be beneficial for the solution quality, and at the same time reduce the computational cost. The variance of tuning the same algorithm on the same dataset can be large; the differences among different algorithms and datasets can also vary. As a result, the early stopping method needs to be adaptive and robust to diverse scenarios.
4. Regret based Stopping
In this section, we review the basics of Bayesian Optimization in Section 4.1, and then propose our novel regretbased stopping criterion for BO in Section 4.2, which employs crossvalidation.
4.1. Bayesian Optimization
Assume we have a learning algorithm defined by its hyperparameters
and parametrised by a weight (parameter) vector
: . Let be the collected dataset of pairs drawn from unknown data distribution . The goal of HPO is then to find the best hyperparametrs optimizing the expected loss . In practice, the data distribution is unknown, and an empirical estimate is used instead. The available data is split into and , used for training and validation. One can also use crossvalidation and report the average loss across different validation folds. Formally, the bilevel optimization problem over hyperparameters and weights is as follows:BO is an iterative gradientfree optimization methods which, at every step , selects an input and observes a noiseperturbed output , where is typically assumed to be i.i.d. (sub)Gaussian noise with variance (proxy) . BO algorithms aim to find the global maximizer by leveraging two components: (i) a probabilistic function model, used to approximate the gradientfree function , and (ii) an acquisition function which determines the next query. A popular choice for the probabilistic model (or surrogate) is a Gaussian process (GP) (Rasmussen and Williams, 2006), specified by a mean function and a kernel . We assume the objective is sampled from a GP prior, i.e., , thus, for all
values are normally distributed, i.e.,
. After collecting data points , the GP posterior about value at a new point is defined by posterior mean and posterior variance as:(1)  
(2) 
where .
Given a fitted probabilistic model, BO uses an acquisition function to balance the exploration and exploitation tradeoff for suggesting the next hyperparameters. Common choices are probability of improvement (PI) (Kushner, 1963), expected improvement (EI) (Mockus et al., 1978), entropy search (ES) (Hennig and Schuler, 2012), predictive entropy search (PES) (HernándezLobato et al., 2014) as well as maximum value entropy search (MES) (Wang and Jegelka, 2017). We focus on the expected improvement throughout our paper for its simplicity and wide adoption, but our approach is general. Let us denote to be the hyperparameters with the minimum loss so far, the EI for a hyperparameter can be defined as:
where , and denote the CDF and PDF of the standard normal, respectively. In case of noisy observations, the unknown value is replaced by the corresponding GP mean estimate (Picheny et al., 2013). A thorough review of BO can be found in (Shahriari et al., 2016).
Convergence of BO can be quantified by the simple regret:
where are the optimal hyperparameters. It defines the suboptimality in function value. However, the optimum is rarely known in advance, thus can not be computed in practice.
4.2. Stopping criterion for BO
In the following, we propose a stopping criterion for BO which relies on two building blocks: an upper bound on the simple regret and an adaptive threshold that is based on the sample variance obtained via crossvalidation.
4.2.1. Upper bound for simple regret
Even though the optimal is unknown, it is possible to estimate an upper bound for it based on our GP surrogate as shown in (Ha et al., 2019). Specifically, we can upper bound the best value found so far by
(3) 
where , are appropriate constants for the confidence bound to hold and are studied in (Srinivas et al., 2010). Specifically, we used Theorem 1 in (Srinivas et al., 2010) to compute with the modification of using the number of hyperparameters as the size of input domain to accommodate continuous hyperparameters.
Similarly, we can lower bound the true unknown optimum as:
(4) 
(5) 
This upper bound is used for BO with unknown search space in (Ha et al., 2019) to decide when to expand the search space. Loosely speaking, they have shown with high probability, this bound will shrink to a very small value after enough BO iterations under certain conditions. For more details on the theoretical aspects, we refer readers to Theorem 5.1 in (Ha et al., 2019).
4.2.2. Stopping threshold
For small datasets, it is common to use crossvalidation to prevent overfitting. Formally, for the fold crossvalidation, the trainvalidation dataset is split into smaller sets and then are constructed by iterating over these sets. At each BO iteration, the average loss across different validation splits is then reported. The details can be found in Algorithm 1.
Given the validation metrics from different splits, besides mean, one can also compute variance of these metrics. Let us use to denote this sample variance. We are interested in the variance of the crossvalidation estimate of the generalization performance. A simple postcorrection technique to estimate it is proposed by (Nadeau and Bengio, 2003) and is as follows:
(6) 
where and are sizes of the training and the validation sets in fold crossvalidation. We use 10fold crossvalidation in our experiments, thus, the post correction constant on the variance is . We also empirically validate that with larger , the variance of crossvalidation metrics indeed tends to be higher in Section A.2.
In BO, we have for every , and for the stopping threshold we need to decide on using an average estimate of or a specific for some . To answer this question, we conducted an ablation study on the correlation between the sample variance in crossvalidation and its mean performance in Section A.3. We found out that the sample variance in crossvalidation is indeed depending on the hyperparameter configuration, thus we propose to use only the variance of the incumbent .
Now we are ready to introduce our stopping criterion. Given as the upper bound of the distance to the optimal function value at iteration and as the standard deviation of the generalization error estimate for the current incumbent, we terminate BO if the following condition is met:
(7) 
The stopping condition has the following interpretation: Once the maximum plausible improvement becomes less than the standard deviation of the generalization error estimate, further evaluations will not reliably lead to an improvement in the generalization error. The variancebased threshold is problem specific and adapts to a particular algorithm and data. The pseudo code of our method can be found in Algorithm 1.
5. Experiments
We study how the speedup gained from the early stopping affects the final test performance. To this end, we firstly compare our method to the setting with the default number of iterations, and then evaluate the existing stopping criteria, such as (Nguyen et al., 2017; Lorenz et al., 2016). We present experimental results on tuning 3 common models, Linear Model trained with SGD (LM) , Random Forest (RF) and XGBoost (XGB), on 19 small datasets (less than 10k instances) with 10fold crossvalidation.
Experimental setup. In BO, we optimize classification error or rooted mean square error computed by crossvalidation. These errors are positive by definition, and we incorporate this prior knowledge by modelling log transformation of these errors and then adapting the variance, accordingly. We use the number of hyperparameter evaluations as the budget for BO. We report the test performance computed on the fixed test split. We refer the reader to the Appendix A for BO settings (Section A.1.1), the detailed hyperparameter search space of the algorithms (Section A.1.2), as well as characteristics of the datasets and their splits (Section A.1.3). We apply early stopping only after the first 20 iterations, to ensure robust fit of the surrogate models both for our method and the baselines. The only hyperparameter involved into our method is that is set such that confidence bounds in Eqs. 4 and 3 hold with high probability. We use Theorem 1 in (Srinivas et al., 2010) to set and further scale it down by a factor of 5 as defined in the experiments in (Srinivas et al., 2010), it is then fixed for all the experiments.
Metrics. To measure the effectiveness of a termination criterion, we analyze two metrics, quantifying the change in test error, as well as the time saved. Particularly, given BO budget T, we compare the test error when early stopping is triggered to the test error . For each experiment, we compute relative test error change, i.e., RYC (we use to denote the test error), as:
(8) 
RYC allows aggregating the results over different algorithms and datasets as RYC , and can be interpreted as follows: A positive RYC represents an improvement in the test error when applying early stopping, while a negative RYC indicates the opposite.
Similarly, let the total training time for a predefined budget be and the total training time when early stopping is triggered be . Then the relative time change, i.e., RTC, is defined as:
(9) 
A positive RTC, where RTC , indicates a reduction in total training time.
5.1. Comparing to default budget
We firstly study our stopping criterion for all datasets and algorithms under the predefined BO budget and visualize the corresponding RYC and RTC scores in Fig. 5. Each dot in Fig. 5 represents an experiment sorted on the axis by RTC score. One can see that in the experiments, where our early stopping is triggered, many RYC scores are nonnegative, showing that our method was able to either improve or match the default test error. However, there are a few cases where our method leads to worse test errors and thus negative RYC scores.
We further demonstrate the effectiveness of our early stopping criterion under different BO budgets and show how much we can improve over the default setting for . We present the resulting distributions of RTC and RYC scores in Fig. 6 with violin plots. We choose to use violin plot instead of box plot because the boxes are in many cases not visible due to the clustered scores, while the violin plot clearly reveals the density of the values.
From Fig. 6, it can be seen that our method is effective under all budgets: stopping does not harm the solution on average as RYC scores are concentrated around 0 while the speed up is noticeable especially for large budgets.
5.2. Comparing to näive convergence test
We compare our method with a näive convergence test controlled by a parameter : BO is stopped once the best observed validation metric remains unchanged for consecutive iterations. This method mimics the early stopping during algorithm training with two notable differences: First, we only track the validation metrics of the incumbent instead of the suggested hyperparameters at every iteration because the later may underperform due to the exploration nature in BO. Second, defining a threshold is not necessary as the incumbent may stay the same for many iterations and then suddenly change, as shown in Fig. 1.
This convergence condition heavily relies on , which is chosen in advance. However, the optimal is different across experiments. We consider values commonly used in practice, in particular, and BO budget . The results for RYC and RTC distributions are presented in Fig. 7.
A general obvious trend on illustrated in Fig. 7 is as following: as increases, the speed up decreases, e.g., the average RTC drops from to as increases from to . However, the solution quality increases as well, and one can see a significant gain in the mean RYC score, except for LM. One can notice distinguishable differences between this convergence baseline and our method: our adaptive stopping condition results not only in the best average RYC score, but also in the smallest variance, which shows that it delivers a more robust solution. Moreover, it sometimes outperforms the baselines by a large margin, e.g., for XGBoost, it improves from RYC () to RYC and, for random forest, it improves from RYC to RYC (3.6 times better) while being 1.3 times slower than the convergence check with .
At this point, we want to highlight that having a competitive RYC score is a much more challenging task than just gaining speedup. If one aims at maintaining the solution quality while stopping BO earlier, one needs to take into account the probabilistic model and BO process in a more comprehensive manner.
5.3. Comparing to other stopping criteria
Finally, we study two existing conditions for terminating BO, both relying on a predefined threshold that has to be tuned. The first one terminates BO once the value of the Expected Improvement (EI) acquisition function drops below the threshold (Nguyen et al., 2017). The second one uses a mixed approach and defines the termination threshold for the Probability of Improvement (PI) over the incumbent while still using EI as the acquisition function (Lorenz et al., 2016). By relying on EI and PI, these stopping criteria inherit their explorationexploitation tradeoff. However, these approaches are not problem adaptive: they rely on a fixed threshold and do not take into account the variance obtained from crossvalidation.
We set the BO budget to 200, and avoid termination during the first 20 iterations. We follow the recommendations from (Nguyen et al., 2017; Lorenz et al., 2016) and firstly consider several values for each of the thresholds: for EI based stopping we use , and for PI based stopping we use . Empirically, we observe that lower thresholds lead to worse RYCRTC tradeoff: it decreases the average RTC score only by around 5% while increasing the average RYC scores only by around 0.5%. This highlights the challenge of setting the threshold properly for each experiment. As a result, we report only the results of using for EI based stopping and for PI based stopping. Fig. 8 illustrates the corresponding distribution of RTC and RYC scores for our method and these two baselines.
The EI and PI based stopping criteria behave similarly in terms of both RTC and RYC scores. The methods tend to stop BO much earlier than our method, thus leading to significant speed up as shown in the left of Fig. 8. However, and not surprisingly, such aggressive early stopping leads to worse test performance on average, as shown in the right of Fig. 8. Moreover, the variance of the test performance for the baselines is larger, which is in contrast to the robustness provided by our method.
One can observe that XGBoost is not early stopped as frequently as the other two algorithms. We suspect that it is because XGBoost has 9 tuning hyperparameters (the others have 3) and it is commonly known that GP works well in a low dimensional setting. To validate this, we repeat the XGBoost tuning experiments but with only three hyperparameters (n_estimators, max_depth and learning_rate) and we denote this new tuning task as XGB (small). We then compare our early stopping results when tuning XGBoost with these two search spaces in Fig. 9. Indeed, when tuning XGBoost with only 3 hyperparameters (thus easier for GP to model), the average RTC score is improved by 50%. However, comparing to the speed up in tuning LM and RF, it is still relatively low.
6. Conclusions
This work investigated the problem of overfitting in BO, focusing on the context of tuning the hyperparameters of machine learning models. We proposed a novel stopping criterion based on two theoretically inspired quantities: an upper bound on the suboptimality of the incumbent, and a crossvalidation estimate for the variance of generalization performance. These ingredients make the proposed approach problem adaptive, resulting in a method that is very simple to implement, comes with no extra hyperparameters, and is agnostic to the specific BO method. In extensive experiments, we demonstrated that our method adapts successfully to the tuning task at hand. We found that our proposal is robust and consistently finds solutions that have lower variance than baselines.
This paper opens several venues for future work. First, while our method tends to improve the test error from 5 to 10 times compared to baselines, it can be slower on average. Future work could reduce the computational cost by making the stopping strategy less conservative. Second, the variance estimate in Eq. 7 relies on crossvalidation, which can be computationally expensive. As the upper bound on the regret Eq. 5 has a clear interpretation, a promising alternative is to let users specify a threshold in Eq. 7 even without crossvalidation.
References
 Stopping active learning based on predicted change of f measure for text classification. 2019 IEEE 13th International Conference on Semantic Computing (ICSC). External Links: ISBN 9781538667835, Document Cited by: §2.
 Bayesian optimization in alphago. External Links: 1812.06855 Cited by: §1.
 Bayesian optimization meets bayesian optimal stopping. In ICML, Cited by: §2.
 Bayesian optimization with unknown search space. In Advances in Neural Information Processing Systems 32 (NIPS), pp. 11795–11804. Cited by: §1, §4.2.1, §4.2.1.
 Entropy search for informationefficient global optimization. Journal of Machine Learning Research 98888 (1), pp. 1809–1837. Cited by: §4.1.
 Predictive entropy search for efficient global optimization of blackbox functions. In Advances in neural information processing systems (NeurIPS), pp. 918–926. Cited by: §4.1.
 Stopping criterion for active learning based on deterministic generalization bounds. External Links: 2005.07402 Cited by: §2.
 Distributionally robust bayesian optimization. Cited by: §2.
 Learning curve prediction with bayesian neural networks. In ICLR, Cited by: §2.
 A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. In Joint Automatic Control Conference, pp. 69–79. Cited by: §4.1.

Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks.
In
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics
, S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, pp. 4313–4324. Cited by: §1, §2.  Stopping criteria for boosting automatic experimental design using realtime fmri with bayesian optimization. External Links: 1511.07827 Cited by: §A.1.1, §1, §5.3, §5.3, §5.
 Optimization, fast and slow: optimally switching between local and Bayesian optimization. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3443–3452. Cited by: §1.

On the state of the art of evaluation in neural language models
. In International Conference on Learning Representations, External Links: Link Cited by: §1.  The application of Bayesian methods for seeking the extremum. Towards Global Optimization 2 (117129), pp. 2. Cited by: §4.1.
 Inference for the generalization error. Machine learning 52 (3), pp. 239–281. Cited by: §1, §4.2.2.
 Stable bayesian optimization. In Advances in Knowledge Discovery and Data Mining, J. Kim, K. Shim, L. Cao, J. Lee, X. Lin, and Y. Moon (Eds.), Cited by: §2.
 Distributionally robust bayesian quadrature optimization. Cited by: §2.
 Regret for expected improvement over the bestobserved value and stopping condition. pp. . Cited by: §A.1.1, §1, §5.3, §5.3, §5.
 Quantilebased optimization of noisy computer experiments with tunable precision. Technometrics 55 (1), pp. 2–13. External Links: Link, Document Cited by: §4.1.
 Early stoppingbut when?. In Neural Networks: Tricks of the Trade, G. B. Orr and K. Müller (Eds.), Lecture Notes in Computer Science, Vol. 1524, pp. 55–69. External Links: ISBN 3540653112 Cited by: §1, §2.
 Early stopping and nonparametric regression: an optimal datadependent stopping rule. J. Mach. Learn. Res. 15 (1), pp. 335–366. External Links: ISSN 15324435 Cited by: §1, §2.
 Gaussian processes for machine learning. Adaptive Computation and Machine Learning, MIT Press, MaxPlanckGesellschaftBiologische Kybernetik, Cambridge, MA, USA. Cited by: §4.1.
 Taking the human out of the loop: a review of Bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §4.1.
 Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2, NIPS’12, pp. 2951–2959. Cited by: §1.
 Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Madison, WI, USA, pp. 1015–1022. External Links: ISBN 9781605589077 Cited by: §1, §4.2.1, §5.
 Freezethaw bayesian optimization. ArXiv abs/1406.3896. Cited by: §2.
 OpenML. ACM SIGKDD Explorations Newsletter 15 (2), pp. 49–60. External Links: ISSN 19310153, Link, Document Cited by: §3.
 Maxvalue entropy search for efficient Bayesian optimization. 34th International Conference on Machine Learning, ICML 2017 7 (NeurIPS), pp. 5530–5543. External Links: 1703.01968, ISBN 9781510855144 Cited by: §1, §4.1.
Appendix A Appendix
a.1. Experiments setting
a.1.1. BO setting
We used an internal BO implementation where expected improvement (EI) together with Mat‘ern52 kernel in the GP are used. The hyperparameters of the GP includes output noise, a scalar mean value, bandwidths for every input dimension, 2 input warping parameters and a scalar covariance scale parameter. The closest opensource implementations are GPyOpt using input warped GP
^{2}^{2}2https://github.com/SheffieldML/GPyOpt or AutoGluon BayesOpt searcher ^{3}^{3}3https://github.com/awslabs/autogluon.We tested two methods to learn the GP hyperparameters in our experiments: either maximizing type II likelihood or using slice sampling to draw posterior samples of the hyperparameters. In the later case, we use average across hyperparameters samples (in our experiments we always use 10 samples) to compute EI and predictions. For Slice sampling, we used 1 chain where we draw 300 samples with 250 as burin and 5 as thinning. We also fixed max step in and step out to 200 and the scale parameter is fixed to 1.
We found out that using slice sampling for learning GP hyperparameters is more robust for model fitting than using maximum likelihood estimates. This is especially important for our baselines (Lorenz et al., 2016; Nguyen et al., 2017) when using maximum likelihood. In that setting, the EI and PI values can have very small values ( to ) due to a bad model fit, triggering stopping signal much earlier than it should be. As a result, we only report experimental results using slice sampling throughout our paper.
a.1.2. Search space of 3 algorithms
Linear Model with SGD (LM), XGBoost (XGB) and RandomForest (RF) are based on scikitlearn implementations and their search spaces are listed in Table 1.
a.1.3. Dataset
We list the datasets that are used in our experiments, as well as their characteristics and sources in Table 2. For each dataset, we first randomly draw 20% as test set and for the rest, we use 10fold cross validations for regression datasets and 10fold stratified cross validation for classification datasets. For the experiments without crossvalidation, we fix the validation set to one of the 10 folds, and the rest 9 folds are used for the training set.
a.2. fold crossvalidation and its variance
We study the variance of fold crossvalidation and its relation to the choice of . We select 50 hyperparameters (first 50 from BO) and allow crossvalidation to reshuffle so that we could have 10 replicates for every choice of . For every hyperparameters configuration, we first compute the standard deviation (std) of crossvalidation metrics for every replicate and then take the average. The resulting plots on 2 datasets and 3 algorithms are shown in Fig. 10. It seems that with higher , the standard deviation of the crossvalidation metrics tends to be larger.
a.3. Heteroscedastic crossvalidation variances
We study the variances of cross validation metrics and its relation to the hyperparameter configurations through hyperparameter evaluations collected in our BO experiments (without early stopping) on 6 example datasets. In Figure 11, the validation error and standard deviation for the hyperparameters are shown in the axis and axis, respectively. The Pearson correlation coefficients for all the datasets are shown in the legend next to the dataset names. The average correlation coefficients for an algorithm is also shown in the title next to the algorithm name.
From 11, it is clear that the variances of cross validation metrics depends on the hyperparameter configurations, they are mostly positively correlated (in a few cases negatively correlated). For the same dataset, the correlation between the two can change significantly depending on the algorithm being used.