I Introduction
When applying machine learning, several highlevel decisions have to be taken: a learning algorithm needs to be selected (denoted base learner
), and different data preprocessing and feature selection techniques may be applied. Each comes with a set of parameters to be tuned  the
hyperparameters. Automated machine learning (AutoML) addresses the automation of selecting base learners and preprocessors as well as tuning the associated hyperparameters. That allows nonexperts to leverage machine learning. For example, [1]surveys AutoML approaches from the perspective of the biomedical application domain to guide “layman users”. Furthermore, AutoML makes the process of applying machine learning more efficient, e.g., by using automation to lower the workload of expert data scientists. Additionally, AutoML provides a structured approach to identify wellperforming combinations of base learner configurations, typically outperforming those tuned manually or by a grid search heuristic.
This work focuses on the AutoML step of tuning the hyperparameters of individual base learners. In most of the recent works on AutoML, modelbased approaches are employed, such as Bayesian Optimization [2] (see [3], [4]) or, more recently, Probabilistic Matrix Factorization [5]. In addition to hyperparameter tuning they address features such as identifying best machine learning pipelines. Furthermore, [4] and [5] also draw on information about machine learning tasks solved in the past. We are particularly interested in whether Bayesian Optimization is better suited for hyperparameter tuning than, e.g., natureinspired blackbox optimization heuristics. We motivate this by the observation that the work presented in [5] discretizes continuous hyperparameters, which effectively turns them into categorical parameters and thereby gives up on the ambition to find the best performing hyperparameter configuration. Yet, this approach outperforms [4], a widely used Bayesian Optimization approach, which allows for continuous and categorical hyperparameters. As [5] (discretizing the space of machine learning pipeline configurations and thus being suboptimal) outperforms [4] (capable of dealing with fixed and categorical parameters), it calls into question whether the popular modelbased approaches  in particular variants of Bayesian Optimization  appropriately reflect the base learner performance. While a variety of AutoML approaches described in literature are relying on a variety of methods ranging over Bayesian Optimization [2, 4], Probabilistic Matrix Factorization [5]
, and Evolutionary Algorithms
[6], we are not aware of any direct comparison of these optimization approaches dedicated to hyperparameter tuning. Specifically, this paper investigates the performance of Differential Evolution [7], a representative evolutionary algorithm (i.e., a gradientfree blackbox method), relative to Sequential Modelbased Algorithm Configuration (SMAC) [3] as a reference for Bayesian Optimization. The paper focuses exclusively on cold start situations to gain insights into each method’s ability to tune hyperparameters.Ii Related work
Autosklearn [4]
is probably the most prominent example of applying Bayesian Optimization (through the use of SMAC
[3]) for the automated configuration of machine learning pipelines. It supports reusing knowledge about well performing hyperparameter configurations when a given base learner is tested on similar datasets (denoted warm starting or metalearning), ensembling, and data preprocessing. For base learner implementations [4] leverages scikitlearn [8]. [4] studies individual base learners’ performances on specific datasets, but does not focus exclusively on tuning hyperparameters in its experiments. This work takes inspiration from [4] for Experiment 1 (base learner selection) and 2 (dataset and base learner selection).A recent approach to modeling base learner performance applies a concept from recommender systems denoted Probabilistic Matrix Factorization [5]
. It discretizes the space of machine learning pipeline configurations and  typical for recommender systems  establishes a matrix of tens of thousands of machine learning pipeline configurations’ performances on hundreds of datasets. Factorizing this matrix allows estimating the performance of yettobetested pipelinedataset combinations. On a hold out set of datasets
[5] outperforms [4]. We do not include Probabilistic Matrix Factorization in this work as recommender systems only work well in settings with previously collected correlation data, which is at odds with our focus on cold start hyperparameter tuning settings.
TPOT [6]
uses Genetic Programming, an evolutionary algorithm, to optimize machine learning pipelines (i.e., data preprocessing, algorithm selection, and parameter tuning) for classification tasks, achieving competitive results. To keep the pipelines’ lengths within reasonable limits, it uses a fitness function that balances pipeline length with prediction performance. Similar to our work, TPOT relies on the DEAP framework
[9] but uses a different evolutionary algorithm. While hyperparameter tuning is an aspect of TPOT, [6] does not attempt to isolate tuning performance and compare against other hyperparameter tuning approaches.BOHB [10] is an efficient combination of Bayesian Optimization with Hyperband [11]. In each BOHB iteration, a multiarmed bandit (Hyperband) determines the number of hyperparameter configurations to evaluate and the associated computational budget. This way, configurations that are likely to perform poorly are stopped early. Consequently, promising configurations receive more computing resources. The identification of configurations at the beginning of each iteration relies on Bayesian Optimization. Instead of identifying illperforming configurations early on, our work focuses on the hyperparameter tuning aspect. In particular, we study empirically whether alternative optimization heuristics such as evolutionary algorithms can outperform the widely used modelbased hyperparameter tuning approaches.
This work differs from the referenced articles in that it attempts to isolate the hyperparameter tuning methods’ performances, e.g., by limiting CPU resources (a single CPU core) and tight computational budgets (smaller time frames than in [4] and [5]). These tightly limited resources are vital to identifying the algorithmic advantages and drawbacks of different optimization approaches for hyperparameter tuning. Different to, e.g., [5] we do not limit the invocation of individual hyperparameter configurations. That penalizes illperforming but computationally expensive parameter choices. To the best of our knowledge, such scenarios have not been studied in the related literature.
Iii Methods
Iiia Hyperparameter Tuning Definition
Given a validation set, the performance of a trained machine learning algorithm is a function of their continuous and categorical hyperparameters [10]. Therefore, hyperparameter tuning corresponds to finding the best performing algorithm configuration, i.e., (if using an error metric), or (if using an accuracy metric, e.g., for classification tasks).
IiiB Hyperparameter Tuning Methods
IiiB1 Evolutionary Algorithms
This work applies evolutionary algorithms for hyperparameter tuning. These are a subset of the vast range of natureinspired metaheuristics for optimization. Typically, candidate solutions are managed in a population. New solution candidates evolve from the current population by using algorithmic operations that are inspired by the concepts of biological evolution: reproduction, mutation, recombination, and selection. Commonly, a problemspecific fitness function (e.g., the performance ) determines the quality of the solutions. Iterative application of the operations on the population results in its evolution. Evolutionary algorithms differ in how the algorithmic operations corresponding to the biological concepts are implemented.
This work uses Differential Evolution [7], a wellknown and wellperforming directionbased metaheuristic supporting realvalued as well as categorical hyperparameters [12][13]
, as a representative evolutionary algorithm. Unlike traditional evolutionary algorithms, Differential Evolution perturbs the currentgeneration population members with the scaled differences of randomly selected and distinct population members, and therefore no separate probability distribution has to be used for generating the offspring’s genome
[12]. Differential Evolution has only a few tunable parameters as indicated below, hence this work does not require extensive tuning of these parameters for the experimentation. Notably, variants of Differential Evolution produced state of the art results on a range of benchmark functions and realworld applications [13]. While several advanced versions of Differential Evolution exist [12][13], this work focuses on the basic variant using the implementation provided by the python framework DEAP [9]. Should results indicate performance competitive to the standard Bayesian Optimizationbased hyperparameter tuning approach, it will be a promising direction for future work to identify which of the variants is best suited for the hyperparameter tuning problem.Differential Evolution is derivativefree and operates on a population of fixed size . Each population member of dimensionality represents a potential solution to the optimization problem. In this paper, equals the base learner’s number of tunable hyperparameters (). We choose the population size depending on the base learner’s number of hyperparameters (): ( in this work), where is a configurable parameter. For generating new candidate solutions, Differential Evolution selects four population members to operate on: the target with which the offspring will compete, and three other randomly chosen input population members. First, Differential Evolution creates a mutant by modifying one of the three input members. It modifies the mutant’s values along each dimension by a fraction of an application specific distance metric between both remaining input members. Then, the crossover operation evolves the mutant into the offspring: each of the mutant’s dimensions’ values may be replaced with probability by the target’s corresponding value. The newly created offspring competes with the target to decide whether it replaces the target as a population member or whether it is discarded. [14] provides detailed information on Differential Evolution and its operations.
IiiB2 Modelbased
Bayesian Optimization is very popular for hyperparameter tuning. In each iteration , it uses a probabilistic model to model the objective function based on observed data points . An acquisition function based on the current identifies the next  typically by . The identified represents the next hyperparameter configuration to evaluate (i.e., to train and test) the machine learning algorithm with, i.e., . Note that observations of may be noisy, e.g. due to stochasticity of the learning algorithm. After evaluation, Bayesian Optimization updates its model . A common acquisition function is the Expected Improvement criterion [2].
For image classification [2]
obtained stateoftheart performance for tuning convolutional neural networks, utilizing Gaussian Processes to model
. Other approaches employing, e.g., the Tree Parzen Estimator technique [15] do not suffer some of the drawbacks of Gaussian Processes (e.g., cubic computational complexity in the number of samples). Similarly, the SMAC library [3] uses a treebased approach to Bayesian Optimization. [10] provides additional details.This paper relies on SMAC as a representative of Bayesian Optimization as it is a core constituent of the widely used autosklearn library [4]. SMAC “iterates between building a model and gathering additional data” [3]. Given a base learner and a dataset, SMAC builds a model based on past performance observations of already trained and tested hyperparameter configurations. It optimizes the Expected Improvement criterion to identify the next promising hyperparameter configuration. That causes SMAC to search regions in the hyperparameter space where its model exhibits high uncertainty. After trying the identified configuration with the base learner, SMAC updates the model again.
IiiC Experimental Setup and Evaluation
This work focuses on cold start situations to isolate the aspect of tuning a given base learner’s hyperparameters. Consequently, the experiments do not cover other beneficial aspects such as metalearning, ensembling, and preprocessing steps. We denote the application of a hyperparameter tuning method to a preselected base learner on a specific dataset as an experiment run. To study the hyperparameter tuning performances of Differential Evolution and SMAC, we assign equal computational resources (a single CPU core, fixed wallclock time budget) to each experiment run. For assessing Differential Evolution’s ability to tune hyperparameters relative to SMAC, we compare both methods’ performance per experiment run by applying relative ranking. Similar to [4], we account for class occurrence imbalances using the balanced classification error metric, i.e., the average classwise classification error. The run with the better evaluation result (based on fivefold crossvalidation) counts as a win for the corresponding tuning method. In each experiment, we create five data folds using the scikitlearn library [8]. The folds serve as input to both tuning methods’ experiment runs. To break ties of reported results, we award the method requiring less wallclock time to reach the best result with a win. This work experiments with six base learners provided by [8]
: kNearest Neighbors (kNN), linear and kernel SVM, AdaBoost, Random Forest, and MultiLayer Perceptron (MLP).
This study documents two sets of classification experiments. In both, only the overall experiment run wallclock time budget limits execution time, i.e., we do not limit the time taken for each base learner invocation (training and testing). We select the Differential Evolution hyperparameters as , , and after a brief hyperparameter sweep.
Experiment 1. This experiment executes a single experiment run of both tuning methods for each of the base learners for 49 small datasets (less than 10,000 samples) as identified by [5] ^{1}^{1}1www.OpenML.org datasets: {23, 30, 36, 48, 285, 679, 683, 722, 732, 741, 752, 770, 773, 795, 799, 812, 821, 859, 862, 873, 894, 906, 908, 911, 912, 913, 932, 943, 971, 976, 995, 1020, 1038, 1071, 1100, 1115, 1126, 1151, 1154, 1164, 1412, 1452, 1471, 1488, 1500, 1535, 1600, 4135, 40475}. As these datasets are small, we assign a strict wallclock time budget of one hour to each experiment run (onethird of the approximate time budget of [5]).
Experiment 2. The second experiment leverages the efforts of [4], which identified representative datasets from a range of application domains to demonstrate the robustness and general applicability of AutoML. For that, [4] clustered 140 openly available binary and multiclass classification datasets covering a diverse range of applications, such as text classification, digit and letter recognition, gene sequence and RNA classification, and cancer detection in tissue samples based on the datasets’ metafeatures into 13 clusters. A representative dataset represents each cluster. Of these 13 representative datasets, we select ten^{2}^{2}2www.OpenML.org datasets: {46, 184, 293, 389, 554, 772, 917, 1049, 1120, 1128}
not requiring preprocessing such as imputation of missing data and apply all six base learners to each. All experiment runs receive a time budget of 12 hours (half of the budget in
[4]). We repeat this experiment for each base learner and dataset five times per tuning method. Per repetition, five data folds are created and presented to both tuning methods so that they face the same learning challenge.Generalization of results. In total, the experiments cover 59 openly available datasets from www.openml.org as identified in [4] and [5]
. To help generalize the findings in this study, we treat each pairwise experiment run as a Bernoulli random variable and apply statistical bootstrapping (10,000 bootstrap samples, 95% confidence level) to infer the confidence intervals for the probability of Differential Evolution beating SMAC. As the dataset selection criteria differ between Experiment 1 and 2, we will discuss the experiments’ results separately.
Iv Experiment Results
Iva Experiment 1
Table I documents the results of running both tuning algorithms for individual base learners on the 49 small datasets, each with a one hour wallclock budget on a single CPU core.
When considering only the maximum mean fivefold balanced error to rank the tuning methods, Differential Evolution (denoted DE in Table I and II) scores 19.4% more wins than SMAC (129 wins to 108). 19.4% of the experiment runs result in a tie where both tuners achieve the same error. Breaking these ties (based on which tuning method reached its best result first) shows the benefits of Differential Evolution even clearer. It scores 37.1% more wins than SMAC.
On a perlearner perspective, the picture is diverse. Differential Evolution clearly outperforms SMAC for Random Forest and AdaBoost. For both SVM algorithms, SMAC wins on more than half of the datasets  even after tiebreaking. For kNN, both tuning methods achieve equal performance after breaking the ties. For MLP there is only a single tie and Differential Evolution wins in the majority of cases.
kNN  Linear  Kernel  AdaBoost  Random  MLP  sum  

SVM  SVM  Forest  
DE  7 (25)  6 (18)  16 (20)  38 (40)  34 (39)  28 (28)  129 (170) 
SMAC  17 (24)  27 (31)  28 (29)  8 (9)  8 (10)  20 (21)  108 (124) 
tie  25 (0)  16 (0)  5 (0)  3 (0)  7 (0)  1 (0)  57 (0) 
IvB Experiment 2
IvB1 Relative Performances
Table II lists the counts of experiment runs that Differential Evolution wins or ties against SMAC. When aggregated over all learners and datasets (bold entries), Differential Evolution outperforms SMAC by scoring 14.5% more wins. Similar to Experiment 1, breaking the ties benefits Differential Evolution: after tiebreaking, it scores 22.7% more wins than SMAC.
On a perlearner perspective, i.e., aggregated over all datasets, Table II indicates that for kNN, and linear SVM both methods perform similarly, even after breaking the ties. Kernel SVM is slightly favorable to Differential Evolution with 26.3% more wins than SMAC before tiebreaking, 31.6% more wins after tiebreaking. For Random Forest, Differential Evolution scores significantly more wins (30% more before tiebreaking, 50% after), while for MLP SMAC does (38.1% more wins than Differential Evolution, no ties). Differential Evolution most clearly outperforms SMAC for AdaBoost by achieving more than twice the number of wins of SMAC. When excluding AdaBoost from the experiment results, Differential Evolution (100 wins) is on par with SMAC (101 wins) before tiebreaking. After tiebreaking, Differential Evolution scores 9.8% more wins than SMAC (123 wins to 112).
On a perdataset perspective, Table II indicates that Differential Evolution performs as good as or better than SMAC for most datasets. Before as well as after tiebreaking, SMAC scores more wins on datasets 293, 389, 554, and 1120. Conversely, Differential Evolution achieves a lower mean balanced error in crossvalidation than SMAC on six of the ten datasets. Even when ignoring AdaBoost in the results, the base learner that Differential Evolution most clearly outperformed SMAC on, SMAC still only wins four out of ten datasets.
IvB2 Learning Curves  Progress of Hyperparameter Tuning
Figure 1 and 2 visualize balanced accuracy over time for selected learners and datasets. Crosses represent tested individuals in the population of Differential Evolution. The prominent horizontal spacing of crosses indicates the time needed to complete the training of base learners with a hyperparameter configuration. The figures also illustrate SMAC’s progress of tuning hyperparameters. As we used a SMAC implementation logging only timing information when identifying a new best configuration, the figures do not provide information about the training time needed for different hyperparameter configurations chosen by SMAC between two best configurations.
Figure 1 shows steady learning progress for both hyperparameter tuners on dataset 1128 for the kNN classifier. The base learner can process the dataset quickly; therefore no horizontal white spaces are observable for the Differential Evolution plot. The figure depicts a tie between the tuners.
Figure 2 visualizes a case when Differential Evolution outperforms SMAC when tuning AdaBoost for dataset 554. The plot exhibits prominent horizontal spacing for Differential Evolution. The few plotted crosses show that not even the initial population could be evaluated entirely. That means evolution did not start before the budget ran out  a problem for large datasets.
V Discussion
Va Experiment Results
Experiment run statistics. Typical evolutionary algorithms do not rely on a model of the process to be optimized but rather on random chance and the algorithmic equivalents of biological evolution. When compared to modelbased methods such as Bayesian Optimization in the context of hyperparameter tuning, this may or may not represent a drawback. As typical evolutionary methods do not use gradients in their optimization progress, they usually have to repeat objective function evaluation more often than gradientbased methods. However, Bayesian Optimization might be misled if its model should be illsuited for the specific base learner whose hyperparameters are tuned. In this respect, the experiments show interesting results: Differential Evolution performs at least as good as SMAC, a Bayesian Optimization method leveraged in autosklearn [4], for a variety of datasets and across a range of machine learning algorithms. In fact, Differential Evolution scores more wins than SMAC both in Experiment 1 (19.4% more wins without tiebreaking, 37.1% with tiebreaking) and 2 (14.5%, 22.7%).
Figure 3  5 exhibit consistent behavior for each tuning method, which we also confirmed for other base learners and datasets (not shown). That suggests that experiment runs per tuning method, base learner, and dataset are sufficiently informative for analyzing Experiment 2 results.
For Experiment 2 MLP is the only base learner on which SMAC significantly outperforms Differential Evolution. Both tuning algorithms perform similarly on two learners (kNN and linear SVM), and Differential Evolution outperforms SMAC on three learners (kernel SVM, Random Forest, AdaBoost) with the most definite results for AdaBoost. According to the experimental results of [4], AdaBoost performs well compared to other learning algorithms on most datasets. Thus, Differential Evolution’s strong performance in both experiments for AdaBoost suggests to use it rather than SMAC for tuning AdaBoost’s hyperparameters. Note that the perlearner tendencies between Experiment 1 and Experiment 2 differ for kNN, linear SVM, and kernel SVM: without tiebreaking SMAC wins more often in Experiment 1, but not so in Experiment 2. Also for MLP, the results reverse between both experiments: Differential Evolution wins more often in Experiment 1, but SMAC in Experiment 2. Only AdaBoost and Random Forest are winners for Differential Evolution in Table I and II.
Tiebreaking usually favors Differential Evolution, i.e., it is faster to report the maximum accuracy achieved by both tuning methods. SMAC outperforms Differential Evolution in the early stages  in particular on the bigger datasets, if the budget is too short for the evolution phase to make significant tuning progress or even to start at all.
Table II states that tiebreaking does not resolve 15 ties for datasets 293 and 554. This only occurs if a given SMAC experiment run and its Differential Evolution counterpart do not report a single evaluation result within the time limit. That indicates that the 12 hour time budget is challenging for these datasets, in particular when tuning the hyperparameters of kNN and kernel SVM (Table II). In Experiment 2, datasets 293 and 554 are the largest (number of samples times the number of features per sample, see [4]). Figure 2 illustrates the learning progress of AdaBoost on dataset 554. The prominent horizontal spacing for Differential Evolution’s learning curve confirms that large datasets require substantial computation time to train and test a single hyperparameter configuration  note the difference to Figure 1 on the smaller dataset 1128.
Inferential statistics. Figure 6, 7, and 8 illustrate the intervals of 95% confidence of the Bernoulli trial probability of Differential Evolution successfully outperforming SMAC. The breaking of ties generally favors Differential Evolution, and there is a noticeable upward shift for many of the confidence intervals after tiebreaking. Most peralgorithm and perdataset confidence intervals in Figures 68 cross the 50% reference line, suggesting that there is no significance at the 95% level for a success probability strictly above or below 50%. That implies that these confidence intervals do not provide statistical justification to prefer either tuning method. However, several of these confidence intervals tend to favor Differential Evolution  larger shares reside above the reference line than below. With additional experiment runs in the future, the confidence intervals should shrink, and success or failure probability may become statistically significant. Figure 8 shows that with high confidence, the Differential Evolution results are negative for datasets 293 and 389 as the intervals’ upper bounds stay below the 50% reference line. Also, dataset 1120, and to a much lesser extent also dataset 554, tends to favor SMAC as a larger portion of the confidence interval resides below the reference line. When aggregating for Experiment 1 all base learners after tiebreaking, Figure 6 suggests a statistically significant result in favor of Differential Evolution. After tiebreaking, Figures 6 and 7 suggest statistical significance of Differential Evolution outperforming SMAC for tuning AdaBoost in both experiments. For Random Forest results also tend to favor Differential Evolution, however less strongly. It is striking that both ensemblebased methods (AdaBoost, Random Forest) are favorable to Differential Evolution and the results are less clear or negative for the other learning algorithms. However, we have not been able to identify the algorithmic reason for this behavior yet, and it remains a research question to investigate what determines the tuners’ performance when tuning different learners’ hyperparameters, and why.
For the total aggregate of Experiment 2, Figure 7
shows that even after tiebreaking the confidence interval crosses the reference line  its lower bound reaches 49%. Statistical ttests confirm that Differential Evolution’s total aggregate success chance being larger than 50% in Experiment 2 is not significant at the 95% confidence level (but it is at the 90% level, results omitted for brevity). However, it is close to being statistically significant at the 95% level. Overall, the statistical results are encouraging future work. We anticipate that several of the advanced Differential Evolution variants in
[12][13] will improve on our experiment results and tip the scale against Bayesian Optimization.By design, the experimental setup is meant to present a challenge for both hyperparameter tuning methods to investigate their tuning performance relative to one another. In this setup, both tuning methods may suffer from large datasets, limited CPU resources, and tight time budgets. Note that here the iterative approach of Bayesian Optimization is a strength when compared to Differential Evolution. As Bayesian Optimization collects new samples, it updates its probabilistic model. Even if the time budget is small, as long as it evaluates more than a single hyperparameter configuration, the successive iterations should sample better and better configurations. On the other hand, if Differential Evolution evaluates the same number of hyperparameter configurations, as long as the number of evaluations is smaller than or equal the population size, its evolution has not started, yet. In that situation, no improvements of the hyperparameter configurations are to be expected, and performance is a matter of chance. A possible way to improve its relative performance on large datasets such as 293 or 554 for which it does not finish evaluating the initial population could be to reorder the initial population members for increasing (expected) computational cost. This way, Differential Evolution can evaluate at least more configurationdataset samples within the budget. In addition, reducing the population size by shrinking when facing very tight time budgets could help Differential Evolution reach the evolution operations earlier. However, that reduces the exploration potential of the method.
Tighter time budgets. Figure 9 and 10 illustrate the Experiment 2 results had different shares of the original 12hour budget been applied. As more computing resources become available, Differential Evolution improves in performance relative to SMAC. At a budget of approximately 30% (4 hours), it crosses SMAC’s score and consistently remains above it. The figures confirm that breaking ties is favorable to Differential Evolution. With tiebreaking, it scores more wins than SMAC for even smaller budgets. A budget of less than 10% (1.2 hours) is required for Differential Evolution to first score more wins than SMAC, with a short period of reversal at a budget of 15%. For larger budgets, Differential Evolution consistently achieves more wins than SMAC. The gap widens as the budget increases.
VB Limitations
This work assesses the suitability of the selected optimization approaches for hyperparameter tuning. Therefore it focuses exclusively on cold start situations and does not consider other relevant aspects such as metalearning, ensembling, and data preprocessing steps. Future work will extend to these.
The experimental setup limits the execution of experiment runs to a single CPU core. That reduces the potential impact of how well the used software libraries and frameworks can exploit parallelism. The achievable performance gains also depend on the base learner’s capability of using parallel computing resources. For example, Random Forest is an ensemble method parallel in the number of trees used, whereas AdaBoost is sequential due to the nature of boosting. Future work will study the impact of parallelism on the hyperparameter tuning performance for different methods and base learners.
Vi Conclusion and Future Work
This paper compares Differential Evolution (a wellknown representative of evolutionary algorithms) and SMAC (Bayesian Optimization) for tuning the hyperparameters of six selected base learners. In two experiments with limited computational resources (single CPU core, strict wallclock time budgets), Differential Evolution outperforms SMAC when considering final balanced classification error. In Experiment 1, the optimization algorithms tune the hyperparameters of the base learners when applied to 49 different small datasets for one hour each. In Experiment 2, both optimization algorithms tune the base learners’ hyperparameters for 12 hours each when applied to ten different representative datasets. In the former experiment, Differential Evolution scores 19% more wins than SMAC, in the latter 15%. The results also show that Differential Evolution benefits from breaking ties in a ‘firsttoreportbestfinalresult’ fashion: for Experiment 1, Differential Evolution’s wins 37% more often than SMAC, in Experiment 2 23%. Experiment 2 also shows that only when the budget is tiny, SMAC performs better than Differential Evolution. That occurs when Differential Evolution is late to enter the evolution phase or not even able to finish evaluating the initial population. Differential Evolution is particularly strong when tuning the AdaBoost algorithm. Already with the basic version of Differential Evolution, positive results can be reported with statistical significance for some of the datasets and base learners. That suggests considerable potential for improvements when using some of the improved versions in [12][13].
We see several possibilities to extend this work. First, future work should study if more recent evolutionary algorithms such as the variants of Differential Evolution listed in [12][13] can improve hyperparameter tuning results. A second avenue is to integrate metalearning [16] by choosing the initial population’s parameters accordingly. Then, Probabilistic Matrix Factorization approaches such as [5] will also have to be considered for comparison. Third, an investigation is required to understand why Differential Evolution performs better than SMAC when tuning some of the base learners, while the results are less clear or negative for the other learning algorithms. Fourth, we intend to investigate whether hybrid methods as [10] could benefit from adopting concepts of evolutionary algorithms. Finally, future work will extend to entire machine learning pipelines, i.e., to support also preprocessing steps and ensembling as [4] and study the implications of parallel execution.
References
 [1] G. Luo, “A review of automatic selection methods for machine learning algorithms and hyperparameter values,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 5, no. 1, p. 18, 2016.
 [2] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in neural information processing systems, 2012, pp. 2951–2959.
 [3] F. Hutter, H. H. Hoos, and K. LeytonBrown, “Sequential modelbased optimization for general algorithm configuration,” in Proceedings of the conference on Learning and Intelligent OptimizatioN (LION 5), Jan. 2011, pp. 507–523.
 [4] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in Advances in Neural Information Processing Systems, 2015, pp. 2962–2970.
 [5] N. Fusi, R. Sheth, and M. Elibol, “Probabilistic matrix factorization for automated machine learning,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 3352–3361. [Online]. Available: http://papers.nips.cc/paper/7595probabilisticmatrixfactorizationforautomatedmachinelearning.pdf
 [6] R. S. Olson and J. H. Moore, “Tpot: A treebased pipeline optimization tool for automating machine learning,” in Workshop on Automatic Machine Learning, 2016, pp. 66–74.
 [7] R. Storn and K. Price, “Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces,” Journal of Global Optimization, vol. 11, no. 4, pp. 341–359, Dec 1997. [Online]. Available: https://doi.org/10.1023/A:1008202821328
 [8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikitlearn: Machine learning in python,” Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011.

[9]
F.M. De Rainville, F.A. Fortin, M.A. Gardner, M. Parizeau, and C. Gagné,
“Deap: A python framework for evolutionary algorithms,” in
Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation
, ser. GECCO ’12. New York, NY, USA: ACM, 2012, pp. 85–92. [Online]. Available: http://doi.acm.org/10.1145/2330784.2330799  [10] S. Falkner, A. Klein, and F. Hutter, “Bohb: Robust and efficient hyperparameter optimization at scale,” in International Conference on Machine Learning, 2018, pp. 1436–1445.
 [11] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, “Hyperband: A novel banditbased approach to hyperparameter optimization,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6765–6816, 2017.
 [12] S. Das and P. N. Suganthan, “Differential evolution: A survey of the stateoftheart,” IEEE Transactions on Evolutionary Computation, vol. 15, no. 1, pp. 4–31, Feb 2011.
 [13] R. D. AlDabbagh, F. Neri, N. Idris, and M. S. Baba, “Algorithmic design issues in adaptive differential evolution schemes: Review and taxonomy,” Swarm and Evolutionary Computation, vol. 43, pp. 284 – 311, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2210650217305837
 [14] X. Yu and M. Gen, Introduction to Evolutionary Algorithms. Springer Publishing Company, 2012.
 [15] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyperparameter optimization,” in Advances in neural information processing systems, 2011, pp. 2546–2554.

[16]
M. Lindauer and F. Hutter, “Warmstarting of modelbased algorithm
configuration,” in
AAAI Conference on Artificial Intelligence
, 2018. [Online]. Available: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17235
Comments
There are no comments yet.