Although the Algorithm Selection Problem (ASP, ) has been introduced more than four decades ago, there only exist few works (e.g., [2, 3]), which perform algorithm selection in the field of continuous optimization. Independent of the underlying domain, the goal of the ASP can be described as follows: given a set of optimization algorithms , often denoted algorithm portfolio, and a set of problem instances , one wants to find a model that selects the best algorithm from the portfolio for an unseen problem instance . Albeit there already exists a plethora of optimization algorithms – even when only considering single-objective, continuous optimization problems – none of them can be considered to be superior to all the other ones across all optimization problems. Hence, it is very desirable to find a sophisticated selection mechanism, which automatically picks the portfolio’s best solver for a given problem.
Within other optimization domains, such as the well-known Travelling Salesperson Problem, feature-based algorithm selectors have already shown their capability of outperforming the respective state-of-the-art optimization algorithm(s) by combining machine learning techniques and problem dependent features [4, 5]. As schematized in Figure 1, we now transfer the respective idea of using instance-specific features to single-objective continuous optimization problems based on Exploratory Landscape Analysis (ELA, ) for leveraging solver complementarity of a well-designed algorithm portfolio.
As we show within this work, the integration of ELA results in strong performance improvements (averaged across the entire benchmark) over any of the portfolio’s single solvers. More precisely, our selector requires only half of the resources needed by the portfolio’s single best solver. Hence, our model strongly reduces the gap towards the idealistic – and thus, from a realistic point of view unreachable – virtual best solver.
A more detailed overview of Exploratory Landscape Analysis, as well as an introduction into flacco – an R toolbox for computing a variety of such landscape features enhanced by a graphical user interface – is given in Section II. In Section III, we give more insights into the COCO platform and afterwards describe our experimental setup, including the generation of the considered algorithm portfolio, in Section IV. An analysis of our found algorithm selection models is given in Section V and Section VI concludes our work.
Ii Exploratory Landscape Analysis
While problem-dependent (landscape) features can in general be computed for any optimization problem (e.g., [7, 8, 9, 10, 11]), we will only consider single-objective, continuous optimization problems within this work. For this domain, Mersmann et al. introduced a sophisticated approach for characterizing a problem’s landscape by means of numerical feature values and called it “Exploratory Landscape Analysis” . Within their work, they designed a total of 50 numerical measures and grouped them into six categories of so-called “low-level” features: convexity, curvature, level set, local search, meta model and y-distribution features. These numbers were then used to characterize eight so-called “high-level” properties, such as the degree of multimodality, the separability, the basin size homogeneity or the number of plateaus. However, given that these properties (a) require expert knowledge and consequently can not be computed automatically, and (b) are categorical and thus, make it impossible to e.g., distinguish problems by their degree of multimodality, the introduction of the low-level features can be seen as a major step towards automatically computable landscape features and hence automated algorithm selection.
Already in the years before the term ELA was introduced, researchers have tried to characterize a problem’s landscape by numerical values: Jones and Forrest assessed a problem’s difficulty by means of a fitness distance correlation , Lunacek and Whitley introduced a dispersion metric , Malan and Engelbrecht quantified the landscape’s ruggedness  and Müller and Sbalzarini performed fitness-distance analyses .
However, after Bischl et al.  have shown the potential of landscape analysis by using ELA features for algorithm selection, a manifold of new landscape features has been introduced. Abell et al. used hill climbing characteristics , Muñoz et al. measured a landscape’s information content  and Morgan and Gallagher analyzed the problems with the help of length scale features . More recently, Malan et al. characterized constraint optimization problems , and Shirakawa and Nagao introduced an entire bag of local landscape features .
In other works, research groups, which also include this paper’s authors, have designed features based on a discretization of the continuous search space into a grid of cells , and successfully employed the nearest better clustering approach for distinguishing funnel-shaped landscapes with a global structure from problems, whose local optima are aligned in a more random manner . This particularly facilitates the decision of the class of algorithms which is most suited for the problem at hand. We also showed that even low budgets of observations (
being the problem dimensionality) – i.e., a sample size that is close to the size of an evolutionary algorithm’s initial population – is sufficient for such a distinction. In consequence, the evaluation of this initial sample, which is required for the landscape features, would come without any additional costs, given that the evolutionary algorithm would have to evaluate those points anyways.
In the previous subsection, we have provided an overview of numerous landscape features. Unfortunately, those features were usually – if at all – implemented in different programming languages, such as Python , Matlab  or R , making it extremely complicated to use all of them within a single experiment. This obstacle has been solved (for R-users) with the development of flacco , an R-package for feature-based landscape-analysis of continuous and constrained optimization problems. The package (currently) provides a collection of more than 300 landscape features (including the majority of the ones from above), distributed across a total of 17 different feature sets. In addition, the package comes with several visualization techniques, which should help to facilitate the understanding of the underlying problem landscapes . One can either use the package’s stable release from CRAN111https://cran.r-project.org/package=flacco or its developmental version from GitHub222https://github.com/kerschke/flacco. Note that the latter also provides further information on the usage of flacco, as well as a link to its online-tutorial333http://kerschke.github.io/flacco-tutorial/site/.
Being aware of possible obstacles for non-R-users, we recently developed a web-hosted and hence, platform-independent, graphical user interface (GUI)444http://www.flacco.shinyapps.io/flacco of flacco . A screenshot of that GUI is displayed in Figure 2. The GUI provides a slim, and thus user-friendly, version of flacco. In its left part (highlighted in grey), the user needs to provide information on the function that should be optimized – either by selecting one of the single-objective optimization problems available in the R-package smoof , configuring one of the BBOB-functions, manually defining a function, or by uploading an evaluated set of points. On the right side, one then can either compute a specific feature set or visualize certain aspects of the optimization problem and/or its features.
Iii Exploratory Data Analysis of COCO
Instead of executing the optimization algorithms ourselves, we use the results from COCO555http://coco.gforge.inria.fr/ , which is a platform for COmparing Continuous Optimization algorithms on the Black-Box Optimization Benchmark (BBOB [32, 33]). The platform provides a collection of the performance results of 129 optimization algorithms, which have been submitted to the BBOB workshops at the GECCO conference666http://sig.sigevo.org/index.html/tiki-index.php?page=GECCOs of the years 2009, 2010, 2012, 2013 and 2015. Using the results published on this platform comes with the advantage that over the years basically all important classes of optimization algorithms, including current state-of-the art optimizers, have been submitted. Thus, experiments are of representative character in a comprehensive setting.
Iii-a General Setup of the COCO-Platform
The competition settings were quite similar across the years: per dimension and function , the participants had to submit the results for a total of 15 problem instances. As shown below, only five instances (IIDs 1 to 5) were part of each competition, while the remaining ten instances changed per year:
2009: IIDs 1 to 5 (with 3 replications each)
2010: IIDs 1 to 15 (1 replication each)
2012: IIDs 1 to 5, 21 to 30 (1 replication each)
2013: IIDs 1 to 5, 31 to 40 (1 replication each)
2015: IIDs 1 to 5, 41 to 50 (1 replication each)
For each pair of problem instance and optimization algorithm , the submitted data contains a log of the performed number of function evaluations and the corresponding achieved fitness value, enabling an a posteriori evaluation of the solver’s performance. More precisely, this data allows to decide whether the solver was successful777 is , if the solver finds a solution , whose fitness value lies within ; otherwise . to approximate the (known) global optimum of instance up to a precision value and also, how many function evaluations were performed until the solver (un)successfully terminated. These values were then used to compute the solver’s Expected Runtime (ERT) :
Although the aforementioned setup comes with some drawbacks, we will still use it as it has been well-established in the community and thus allows for wide comparability. Nevertheless, we would like to address its pitfalls as well: (a) Given the strong variability across the instances’ objective values, it is not straightforward to use a fixed absolute precision value for comparing the solvers across all instances. Instead, it might be more reasonable to use a relative threshold. (b) Computing the ERT based on different instances – even if they belong to the same BBOB problem – is very questionable as the instances can have completely different landscapes: if the original instance is a highly multimodal function, the transformed instance888Per definition, instances are identical up to rotation, scaling and shifts . could – if at all – consist of only a few peaks. Therefore, we strongly encourage to ask for several solver runs on the same instance in future setups, i.e., perform (at least five to ten) replicated runs, and then evaluate the performance results per instance rather than per function allowing for ERT computations on function and instance level.
Iii-B Preprocessing COCO’s Solver Performances
We will use the performance results of the 129 optimization algorithms available at COCO999An overview of all submitted optimization algorithms along with their descriptions can be found at http://coco.gforge.inria.fr/doku.php?id=algorithms.. However, in order to work with a valid data base, we first performed some sanity checks on the platform’s data before combining it in a joint data base.
While the submitted results of all 29 (2009) and 23 (2010) solvers of the first two competitions passed these checks, in the following years, only 22 of 28 (2012), 16 of 31 (2013) and 13 of 18 (2015) submissions did so. The invalid submissions partially used settings of the previous years. However, in order to use the most general portfolio of solvers, we only considered the submissions for IIDs 1 to 5 with only one run per instance, as this is the only set of instances that was used across all five BBOB competitions. Fortunately, the performances of all 129 solvers could be considered for our experiments, because even the problematic submissions from above had valid data for the first five instances.
Iv Experimental Setup
Iv-a Algorithm Performance Data
For each of the 129 solvers we computed the ERT per tuple of problem dimension , BBOB problem respectively function and problem instance (if multiple runs exist, we only considered the first run), resulting in a total of observations. The ERTs were computed for a precision threshold of , because smaller values led to too many unsuccessful runs. Even for this chosen precision, only approximately of all (considered) runs terminated successfully.
Iv-B Instance Feature Data
Each of the ELA feature sets was computed using a so-called improved latin hypercube design  consisting of observations, which were sampled across the decision space, i.e., . The feature sets were then computed using the R-package flacco  for all four problem dimensions, 24 BBOB problems and five problem instances that were used by the performance data (see Section IV-A). For each of these 480 observations, we calculated the six classical ELA feature sets (convexity, curvature, levelset, local search, meta-model and y-distribution) , as well as the basic, (cell mapping) angle101010 The cell mapping angle features were computed using three blocks per dimension, as larger values would result in too many empty cells due to the “curse of dimensionality”.
The cell mapping angle features were computed using three blocks per dimension, as larger values would result in too many empty cells due to the “curse of dimensionality”., dispersion , information content , nearest better clustering  and principal component features, resulting in a total of 102 features per problem instance.
Although being conscious of the resulting information loss, we aggregate each feature across the five problem instances (per BBOB problem) via the median of the respective feature values, in order to map our feature data to the 96 observations (24 problems, four dimensions) of the performance data.
Iv-C Constructing the Algorithm Portfolio
For meaningfully addressing the algorithm selection task, the choice of the underlying algorithm portfolio is crucial. Ideally, the considered set should be as small and as complimentary as possible and should include state-of-the art optimizers. For this purpose, we ranked the solvers per considered BBOB problem based on ERT performance. We then constructed four solver sets (one per dimension), each of them containing the solvers that ranked within the “Top 3” of at least one of the 24 functions of the respective dimension. Based on these four solver sets – consisting of between 37 and 41 solvers each – a portfolio of 12 solvers was constructed by only considering optimizers that belonged to each of the four sets.
The 12 optimization algorithms from the found portfolio can be grouped into four categories and are summarized below.
Iv-C1 Deterministic Optimization Algorithms (2)
The two solvers of this category are variants of the Brent-STEP algorithm111111The Brent-STEP algorithm itself accelerates the global line search method STEP (“select the easiest point”)  by using Brent’s method . . It performs axis-parallel searches and chooses the next iteration’s search dimension either using a round-robin (BSrr, 
) or a quadratic interpolation strategy (BSqi, ).
Iv-C2 Multi-Level Approaches (5)
The origin of most solvers belonging to this category is the multi level single linkage method (MLSL, [39, 40]). It is a stochastic, multistart, global optimizer that relies on random sampling and local searches. Aside from MLSL itself, some of its variants also belong to our portfolio: an interior-point version for constrained nonlinear problems (fmincon, ), a quasi-Newton version, which approximates the Hessian using BFGS  (fminunc, ), and a hybrid variant whose most important improvements are related to its sampling phase (HMLSL, ). The final optimizer belonging to this group is the multilevel coordinate search (MCS, ), which splits the search space into smaller boxes – each containing a known observation – and then starts local searches from promising boxes.
Iv-C3 Variants of the CMA-ES (4)
In 2001, Hansen introduced one of the most popular evolution strategies: the Covariance Matrix Adaption Evolution Strategy (CMA-ES)  with cumulative step-size adaptation (CMA-CSA, ). It led to a plethora of variants , including the following three solvers from our portfolio: (1) IPOP400D , a restart version of the CMA-ES with an increasing population size (IPOP-CMA-ES, ) and a maximum of function evaluations. (2) A hybrid CMA (HCMA, ), which combines a bi-population self-adaptive surrogate-assisted CMA-ES121212A BIPOP-CMA-ES  is a multistart CMA-ES with equal budgets for two interlaced restart strategies: one with an increasing population size and one with varying small population sizes. (BIPOP-aACM-ES-k, ), STEP  and NEWUOA  to benefit from surrogate models and line searches simultaneously. (3) A sequential, model-based algorithm configuration (SMAC, ) procedure applied to the BBOB problems (SMAC-BBOB, ). It uses Gaussian processes (GP, ) to model the the expected improvement function and then performs one run of DIRECT  (with evaluations) and ten runs of the classical CMA-ES  (with evaluations) on the expected improvement function.
Iv-C4 Others (1)
). It is a commercial, heuristic, multistart algorithm that was designed to find the global optima of smooth constrained nonlinear programs (NLPs) and mixed integer nonlinear programs (MINLPs). The algorithm uses theOptQuest Callable Library (OCL, ) to generate candidate starting points for a local NLP solver.
Iv-D Machine Learning Algorithms
We considered three classes of supervised learning strategies for training our algorithm selection models: (1) Aclassification approach, which simply tries to predict the best-performing optimizer131313Ties are broken via random uniform sampling among the tied solvers. and hence, completely ignores the magnitude of performance differences between the best and the remaining portfolio solvers. (2) A regression approach, which trains a separate model for the performances of each optimizer and afterwards predicts the solver with the best predicted performance. (3) In addition to these well-known strategies, we also considered the so-called pairwise regression, which led to promising results in other works (e.g., [4, 5]). In contrast to modelling the performances straight-forward (as in (2)), it models the performance differences for each solver pair and afterwards predicts the solver whose predicted performance difference was the highest, compared to all other solvers.
The algorithm selectors were trained in R  using the R-package mlr . For each class of the supervised learning approaches, we considered recursive partitioning and regression trees (rpart, 
), kernel-based support vector machines (ksvm, 
), random forests (randomForest, 
) and extreme gradient boosting (xgboost, 
). Additionally, we also tried multivariate adaptive regression splines (mars, ) in case of the (paired) regression approaches.
Note that the SVM’s inverse kernel width sigma
was the only hyperparameter that was (automatically) configured – using thesigest function from the R-package kernlab . All other hyperparameters were used in their default settings: the SVMs used Gaussian radial basis kernels and the random forests were constructed using 500 trees, whose split points were sampled random uniformly of (classification) or (regression / paired regression) features with being the data set’s number of (ELA) features.
Iv-E Feature Selection Strategies
The algorithm selectors are initially trained using all 102 features. However, using all of the features simultaneously likely causes lots of noise and/or redundancy, which could lead to poorly performing algorithm selectors. Furthermore, some of the feature sets, namely, the convexity, curvature and local search features from the classical ELA features 
, require additional function evaluations on top of the costs for the initial design. In order to overcome these obstacles, we used the following four feature selection strategies.
A greedy forward-backward selection, which starts with an empty feature set and iteratively alternates between greedily adding and/or removing features as long as this results in an improvement of the model’s performance.
This strategy works analogously to the previous one, but starts with the full set of all 102 features and afterwards alternates between removing and adding features.
A genetic algorithm with population size 10 and 5 offsprings per generation. Here, the selected features are represented as a 102-dimensional bit string, where a value of 1 at positionimplies that the -th feature is selected. The GA runs for a maximum of 100 generations and selects the features by performing random bit flips – with a (default) mutation rate of and a crossover rate of .
A modified version of the previous GA, using a tenfold of offsprings (50 instead of 5) per generation.
Iv-F Performance Assessment
In lieu of using the ERT itself, we will use the relative ERT (relERT) for our experiments. While the former strongly biases the algorithm selectors towards multimodal and higher-dimensional problems due to much larger amounts of used function evaluations, the relERTs, which are also used within the BBOB competitions, allow a fair comparison of the solvers across the problems and dimensions by normalizing each solver’s ERT with the ERT of the best solver for the respective BBOB problem (of that dimension). Instead of scaling each performance with the respective best ERT of all 129 solvers, we used the best performance from the 12 solvers of the considered portfolio as this is our set of interest in this study.
As some solvers did not even solve a single instance from some of the BBOB functions, the respective ERTs and relERTs were not defined. We thus imputed the missing relERTs using the well-known PAR10 score (e.g.,). That is, each of the problematic values is replaced with a penalty value () that is the tenfold of the highest valid relative ERT; for all other values, the respective (finite) relERTs are used.
For each of the supervised learning approaches (see Section IV-D), the algorithm selectors are evaluated using the mean relative ERT
, which averages the relERTs of the predictions (including the costs for the initial design) on the corresponding test data, i.e., a subset of the BBOB problems. In order to obtain more realistic and reliable estimates of the selectors’ performances, they were assessed using leave-one-(function)-out cross-validation. That is, per algorithm selector we train a total of 96 submodels. Each of them uses only 95 of the 96 BBOB problems for training and the remaining one for testing. Note that each problem was used exactly once as test data. Consequently, within each iteration/fold of the leave-one-(function)-out cross-validation, exactly one problem (i.e., the respective test data) is completely kept out of the modeling phase and only used for assessing the respective submodel’s performance. The average of the resulting 96 relERTs is then used as our algorithm selector’s performance.
Following common practices in algorithm selection (e.g., ), we compare the performances of our algorithm selectors with two baselines: the virtual best solver (VBS) and the single best solver (SBS). The virtual best solver, sometimes also called oracle or perfect selector, provides a lower bound for the selectors as it shows the performance that one could (theoretically) achieve on the data, when always selecting the best performing algorithm per instance. Given that the relERT has a lower bound of 1 and that at least one solver achieves that perfect performance per instance, the VBS has to be 1 per definition. Nevertheless, it is quite obvious that algorithm selectors usually do not reach such an idealistic performance given their usually imperfect selections and/or the influence of the costs for computing the landscape features.
The second baseline, the SBS, represents the (aggregated) performance of the (single) best solver from the portfolio. Consequently, this baseline is much more important than the VBS, because an algorithm selector is only useful if its performance (including feature costs) is better than the SBS. Note that in principle, the SBS could either be determined on the entire data set or, for (leave-one-out) cross-validation, be computed per fold. However, within our experiments, both approaches result in identical performances.
V-a Analyzing the Algorithm Portfolio
|Relative Expected Runtime of the 12 Solvers from the Considered Algorithm Portfolio and the 2 Presented Algorithm Selection-Models|
|Group||CSA||HCMA||400D||BBOB||# 1||# 2|
|2||F1 - F5||1.2||1.3||54.8||11.0||11.8||3.7||14.6||18.4||5.8||15.5||17.0||22 014.9||16.6||20.3|
|F6 - F9||18 516.7||9 708.2||7.4||18.6||19.2||5.8||1.7||5.7||11.3||24.2||1.5||27 518.6||3.1||3.5|
|F10 - F14||7 649.2||7 481.5||8.3||1.0||62.7||6.3||1.0||10.7||322.7||1.0||4.9||29 353.2||4.7||4.0|
|F15 - F19||7 406.6||14 710.3||14.7||7 392.0||7 367.7||25.3||8.1||15.5||7.7||7 391.7||7 351.2||29 354.8||26.2||10.1|
|F20 - F24||84.8||14 768.5||7 351.9||4.1||14.5||44.9||3.9||14 679.3||11.4||2.1||2.7||22 014.6||42.5||3.0|
|all||6 240.7||9 318.4||1 549.1||1 546.5||1 556.7||17.7||6.0||3 068.4||74.3||1 547.9||1 536.9||25 990.1||19.3||8.4|
|3||F1 - F5||1.3||1.3||7 367.9||85.2||132.1||356.1||6.8||14 686.6||45.9||55.9||7 347.6||22 015.1||58.4||94.9|
|F6 - F9||331.2||9 527.4||4.7||38.5||9 173.7||4.5||1.9||6.5||31.4||9 173.4||2.5||36 690.3||3.3||39.9|
|F10 - F14||29 356.3||14 712.1||8.9||1.0||4.1||5.0||1.0||12.3||8 132.7||1.0||9.3||29 353.4||4.8||3.6|
|F15 - F19||14 698.2||22 026.2||1.6||14 701.2||14 699.5||2.6||11.4||7 339.4||7 346.9||14 700.0||14 686.2||36 690.3||2.8||7.1|
|F20 - F24||14 741.8||14 758.7||7 389.4||7 339.6||14 677.4||66.8||2.3||22 015.1||7 342.4||7 339.8||1.9||22 014.8||67.0||3.4|
|all||12 304.7||12 316.7||3 077.4||4 616.2||7 677.5||90.4||4.8||9 178.9||4 769.4||6 132.4||4 593.1||29 047.1||28.3||29.4|
|5||F1 - F5||1.4||1.4||7 533.6||14 678.4||14 679.2||12.0||17.5||14 688.7||14 678.1||14 678.5||14 678.0||22 015.1||22.7||22.9|
|F6 - F9||27 597.4||36 690.3||5.6||9 173.5||9 173.8||3.9||2.4||4.9||28.8||9 173.4||9 173.5||36 690.3||4.8||4.8|
|F10 - F14||22 032.8||29 360.3||8.9||1.0||11.9||4.2||1.0||13.6||22 019.2||1.0||10.7||36 690.3||5.2||5.2|
|F15 - F19||36 690.3||36 690.3||3.1||36 690.3||36 690.3||4.3||7 346.1||29 352.5||36 690.3||36 690.3||29 352.5||36 690.3||4.4||4.4|
|F20 - F24||22 053.6||22 050.8||7 400.0||14 678.9||22 014.9||7.7||7 339.8||22 017.4||14 681.0||22 015.0||14 676.8||22 014.9||7.8||7.8|
|all||21 428.3||24 469.8||3 114.6||15 289.0||16 819.9||6.5||3 063.8||13 765.8||18 352.4||16 817.4||13 761.8||30 575.6||9.1||9.2|
|10||F1 - F5||1.6||1.6||14 691.0||14 679.9||14 682.7||2.7||7 365.5||14 698.8||14 680.0||14 679.9||14 678.3||22 015.7||16.3||16.3|
|F6 - F9||36 690.3||27 563.9||4.3||9 173.4||9 173.8||2.2||4.1||9 181.9||9 188.1||9 173.4||9 173.9||36 690.3||2.7||2.7|
|F10 - F14||29 359.3||29 359.8||8.4||1.1||15.4||2.8||1.1||7 352.5||22 018.7||1.1||12.0||36 690.3||3.7||3.7|
|F15 - F19||36 690.3||36 690.3||1.7||36 690.3||36 690.3||2.0||22 028.5||29 352.5||36 690.3||36 690.3||36 690.3||36 690.3||2.1||2.1|
|F20 - F24||36 690.3||29 367.0||14 685.9||22 015.2||22 015.0||23.6||14 677.1||29 352.8||22 018.9||22 014.6||22 014.9||36 690.3||23.7||23.7|
|all||27 519.5||24 472.9||6 123.0||16 817.8||16 821.3||6.9||9 182.4||18 354.6||21 408.0||16 817.6||16 819.7||33 633.1||10.0||10.0|
|all||F1 - F5||1.4||1.4||7 411.8||7 363.6||7 376.5||93.6||1 851.1||11 023.1||7 352.4||7 357.4||9 180.2||22 015.2||28.5||38.6|
|F6 - F9||20 783.9||20 872.4||5.5||4 601.0||6 885.1||4.1||2.5||2 299.8||2 314.9||6 886.1||4 587.9||34 397.4||3.5||12.7|
|F10 - F14||22 099.4||20 228.4||8.7||1.0||23.5||4.6||1.0||1 847.3||13 123.3||1.0||9.3||33 021.8||4.6||4.1|
|F15 - F19||23 871.3||27 529.3||5.2||23 868.5||23 861.9||8.6||7 348.5||16 515.0||20 183.8||23 868.1||22 020.0||34 856.4||8.9||5.9|
|F20 - F24||18 392.6||20 236.3||9 206.8||11 009.4||14 680.5||35.8||5 505.8||22 016.2||11 013.4||12 842.9||9 174.1||25 683.7||35.3||9.5|
|all||16 873.3||17 644.5||3 466.0||9 567.4||10 718.9||30.4||3 064.3||11 091.9||11 151.0||10 328.8||9 177.9||29 811.5||16.7||14.2|
Here, we examine the constructed portfolio to get more insights into the data provided by COCO. In a first step, we compare the best performances of the portfolio solvers to the best performances of all 129 solvers. For 51 of the 96 considered BBOB functions over the dimensions, the portfolio’s VBS is identical to the VBS of all COCO-solvers.
The violin plot on the right side of Figure 3
– which can be understood as a boxplot, whose shape represents a kernel density estimation of the underlying observations – indicates that the ratios between the best ERTs from the portfolio and all COCO-solvers (per BBOB problem) is usually close to one. More precisely, only ten BBOB problems, depicted by black dots in the boxplot of Figure3, have a ratio greater than two, and only three of them – the 2D-version of Schaffers F7 Function (FID 17, ), as well as the 5D- and 10D-versions of the Lunacek bi-Rastrigin Function (FID 24, ) – have ratios worse than three. Consequently, we can conclude that if we find a well-performing algorithm selector, its suggested optimization algorithm likely requires less than twice the number of function evaluations compared to the best algorithm from the entire COCO platform.
As mentioned in Section IV-F, we had to impute some of the performances, because not all optimizers solved at least one instance per BBOB problem. More precisely, HCMA was the only one to find at least one valid solution for all 96 problems, followed by HMLSL (88), CMA-CSA (87), OQNLP (72), fmincon (71), MLSL (69), fminunc (68), IPOP400D and MCS (67), BSqi (52), BSrr (50) and SMAC-BBOB (18). Hence, the mean relERT of HCMA (30.37) was by far the lowest of all considered solvers, making this solver the clear SBS.
However, as shown in Table I, HCMA is not superior to the remaining solvers across all problems. Independent of the problem dimensions, the Brent-STEP variants BSqi and BSrr outperform the other optimizers on the separable problems (FIDs 1 to 5). Similarly, the multilevel methods MLSL, fmincon and HMLSL are superior on the unimodal functions with high conditioning (FIDs 10 to 14), and the multimodal problems with adequate global structure (FIDs 15 to 19) are solved fastest by MCS (in 2D) and CMA-CSA (3D to 10D). The remaining problems, i.e., functions with low or moderate conditioning (FIDs 6 to 9), as well as multimodal functions with a weak global structure (FIDs 20 to 24) do not show such obvious patterns. Nevertheless one can see that HCMA, HMLSL, OQNLP and sometimes also MLSL perform quite promising on these function groups.
The fact that HCMA often is inferior to the other 11 solvers clearly indicates that the data contains sufficient potential for improving over the SBS – e.g., with a reasonable algorithm selector. This finding is supported by the small amount of points located exactly on the diagonal of the scatterplots in Figure 4, which depict the ERT-pairs of the two baseline algorithms for all 24 BBOB problems and per problem dimension. Noticeably, though not very surprising, the VBS and SBS always have higher ERTs on the multimodal problems (FIDs 15 to 24) than on the unimodal problems (FIDs 1 to 14). Also, the shape of the cloud of observations indicates that the problems’ complexity grows with its dimension: while the 2D-observations are mainly clustered in the lower left corner (indicating rather easy problems), the cloud stretches along the entire diagonal for higher-dimensional problems. Furthermore it is obvious that two problems, FIDs 1 (Sphere, indicated by ) and 5 (Linear Slope, ), are consistently solved very fast (independent of its dimensionality), whereas FID 24 (Lunacek Bi-Rastrigin, ) is always the most difficult problem for the two baseline algorithms.
Looking at the performances of HCMA across all 96 problems, one problem stands out: the three-dimensional version of the Büche-Rastrigin function (FID 4). On this (usually rather simple) function, HCMA has a relERT of 1765.9, compared to at most 51.8 on the non-3D-versions of this problem and 249.1 as the worst relERT of all remaining 95 BBOB problems. We therefore had a closer look at the submitted data of HCMA across FID 4 and for all dimensions. The respective data is summarized in Table II. Comparing the results of the 20 optimization runs with each other, three of them seem to be clear outliers. However, it remains unclear whether they were caused by the algorithm itself – it could for instance be stuck in a local optimum without terminating – or whether the outliers were caused by some technical issue on the platform.
|Dim||IID||Gap to||# Function||Success?||ERT of VBS||ERT of SBS|
|2||9.99e-01||570 925||FALSE||219.6||387 784.5|
|2||8.09e-04||2 248||TRUE||486.2||25 197.0|
|2||6.72e-03||4 495||TRUE||1 067.8||4 286.6|
V-B Analyzing the Best Algorithm Selector
After gaining more insights into the underlying data, we use the performance and feature data for training a total of 70 algorithm selectors: 14 machine learning algorithms (see Section IV-D), each of them using either one of the four feature selection strategies described in Section IV-E or no feature selection at all. The classification-based support vector machine, which employed a greedy forward-backward feature selection strategy (sffs), achieved the best performance. While the single best solver (HCMA) had a mean relERT of 30.37, the algorithm selector (denoted as Model 1) had a mean relERT of 16.67 (including feature costs) and 13.60 (excluding feature costs). Therefore, the selector is able to pick an optimizer from the portfolio, which solves the respective problem on average twice as fast.
Looking at the scatterplots shown in the top row of Figure 5, one can see that – with the exception of the 2D Rastrigin function (FID 3, +) – Model 1 predicts on all instances an optimizer that performs at least as good as HCMA. Admittedly, this comparison is not quite fair, as the ERTs shown within the respective row do not consider the costs for computing the landscape features. Hence, a more realistic comparison of the selector and the SBS is shown in the second row of Figure 5. The fact that Model 1 performs worse than HCMA on some of the problems is negligible as the losses only occur on rather simple – and thus, quickly solvable – problems.
The allocated budget for computing the landscape features was , i.e., between 100 and 500 function evaluations depending on the problem dimension. On the other hand, the ERT of the portfolio’s VBS lies between 4.4 and 7 371 411. These numbers (a) explain, why Model 1 performs poor on rather simple problems, and (b) support the thesis that the costs for computing the landscape features only have a small impact on the total costs when solving rather complex problems. However, one should be aware of the fact that the number of required function evaluations for the initial design is in a range of the common size of initial algorithm populations so that in practice no additional feature costs would occur.
These findings are also visible in the upper row of Figure 6, which compares the relERTs of HCMA and Model 1 (on a log-scale) across all 96 BBOB problems (distinguished by FID and dimension). For better visibility of the performance differences, the area between the relERTs of the SBS (depicted by ) and Model 1 () is highlighted in grey. One major contribution of our selector obviously is to successfully avoid using HCMA on the three-dimensional version of FID 4. Moreover, one can see that the SBS mainly dominates our selector on the separable BBOB problems (FIDs 1 to 5), whereas our selector achieved equal or better performances on the majority of the remaining problems.
Eight features resulted from the feature selection approach and were included in Model 1: three features from the y-distribution feature set (the skewness, kurtosis and number of peaks of a kernel-density estimation of the problems’ objective values,), one levelset feature (the ratio of mean misclassification errors when using a LDA and MDA, ), two information content features (the maximum information content and the settling sensitivity, 
), one cell mapping feature (the standard deviation of the distances between each cell’s center and worst observation,) and one of the basic features (the best fitness value within the sample).
V-C Improving the Currently Best Algorithm Selector
Although Model 1 already performs better than the SBS, we tried to improve the previous model by making use of results from previous works. More precisely, we start with the set of features that was employed by Model 1 and try to improve that model by adding the meta-model and nearest better clustering features. The rational behind this approach is that (a) these feature sets – as shown in  – are very useful for characterizing a landscape’s global structure, and (b) the feature set of Model 1 was derived from a greedy approach.
Based on the combination of the three feature sets from above, additional feature selections were performed. Two of the new selectors, namely the ones constructed using the greedy sfbs-approach (mean relERT: 14.27) and the -GA (14.24) – performed better than Model 1. The latter model is from here on denoted as “Model 2”.
As displayed in Table I, Model 2 not only has the best overall performance, but also performs best on the multimodal functions with a weak global structure (FIDs 20 to 24). This effect is also visible in the bottom row of Figure 5, where the problems of the respective BBOB group (colored in pink) are located either in the left half of the respective plots (for 2D and 3D) or on the diagonal itself (5D and 10D).
Also, Figures 5 and 6 reveal that Model 2 differs more often from the SBS than Model 1. This finding is also supported by Table III, which summarizes how often each of the solvers was predicted and how often it was the best solver. Although HCMA is only in roughly 15% of the problems (14 of 96) the best choice, Model 1 selects it in approximately 80% and Model 2 in 46%. Interestingly, the two models only predict three and four different solvers, respectively.
The different behavior of the two selectors is likely caused by the differences in the used features. While three of the features are identical for both selectors – the previously described angle and levelset features, as well as the skewness of the objective values – the remaining six features of Model 2 are meta-model and nearest better clustering (NBC) features. The two meta model features measure (1) the smallest absolute, non-intercept coefficient of a linear model (without interactions) and (2) the adjusted model fit () of a quadratic model (without interactions) . The remaining four NBC features measure (1) the ratio of the standard deviations of the distances among the nearest neighbours and the nearest better neighbours, (2) the ratio of the respective arithmetic means, (3) the correlation between the distances of the nearest neighbours and the nearest better neighbours, and (4) the so-called indegree, i.e., the correlation between fitness value and the count of observations to whom the current observation is the nearest better neighbour .
|True Best Solver||Predicted Solver (Model 1 / Model 2)|
|BSqi||6||3 / 2||3 / 2||0 / 2||0 / 0|
|BSrr||6||1 / 2||5 / 4||0 / 0||0 / 0|
|CMA-CSA||7||0 / 1||7 / 3||0 / 3||0 / 0|
|fmincon||12||0 / 4||8 / 4||0 / 1||4 / 3|
|fminunc||6||1 / 2||4 / 3||0 / 0||1 / 1|
|HCMA||14||0 / 3||14 / 11||0 / 0||0 / 0|
|HMLSL||11||3 / 3||7 / 4||0 / 0||1 / 4|
|IPOP400D||7||0 / 0||7 / 3||0 / 3||0 / 1|
|MCS||4||2 / 1||2 / 0||0 / 3||0 / 0|
|MLSL||12||4 / 2||8 / 6||0 / 2||0 / 2|
|OQNLP||6||0 / 1||6 / 2||0 / 2||0 / 1|
|SMAC-BBOB||5||0 / 0||5 / 2||0 / 1||0 / 2|
|96||14 / 21||76 / 44||0 / 17||6 / 14|
We showed that sophisticated machine learning techniques combined with informative exploratory landscape features form a powerful tool for automatically constructing algorithm selection models for unseen optimization problems. In our setting of continuous black-box optimization, we were able to reduce the mean relative ERT of the single best solver of our portfolio by half relying only on a quite small set of ELA features and function evaluations. A specific R-package enhanced by a user-friendly graphical user interface allows for transferring the presented methodology to other (continuous) optimization scenarios differing from the BBOB setting.
Of course, the quality and usability of such models heavily relies on the quality of the algorithm benchmark as well as on the representativeness of the included optimization problems. While we considered the well-known and commonly accepted BBOB workshop setting, we are aware of possible shortcomings and will extend our research in terms of including other comprehensive benchmarks once available and practical applications.
Finally, we work on extending the methodological approach to constrained and multi-objective optimization problems in order to increase the applicability to real-world scenarios.
The authors acknowledge support from the European Research Center for Information Systems141414https://www.ercis.org/ (ERCIS).
-  J. R. Rice, “The Algorithm Selection Problem,” Advances in Computers, vol. 15, pp. 65 – 118, 1975.
B. Bischl, O. Mersmann, H. Trautmann, and M. Preuss, “Algorithm Selection
Based on Exploratory Landscape Analysis and Cost-Sensitive Learning,” in
Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2012, pp. 313 – 320.
-  M. A. Muñoz Acosta, Y. Sun, M. Kirley, and S. K. Halgamuge, “Algorithm Selection for Black-Box Continuous Optimization Problems: A Survey on Methods and Challenges,” Information Sciences, vol. 317, pp. 224 – 245, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025515003680
-  L. Kotthoff, P. Kerschke, H. H. Hoos, and H. Trautmann, “Improving the State of the Art in Inexact TSP Solving Using Per-Instance Algorithm Selection,” in Proceedings of 9th International Conference on Learning and Intelligent Optimization (LION), ser. Lecture Notes in Computer Science, C. Dhaenens, L. Jourdan, and M.-E. Marmion, Eds., vol. 8994. Springer, January 2015, pp. 202 – 217.
-  P. Kerschke, L. Kotthoff, J. Bossek, H. H. Hoos, and H. Trautmann, “Leveraging TSP Solver Complementarity through Machine Learning,” Evolutionary Computation Journal, (accepted).
-  O. Mersmann, B. Bischl, H. Trautmann, M. Preuss, C. Weihs, and G. Rudolph, “Exploratory Landscape Analysis,” in Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2011, pp. 829 – 836.
O. Mersmann, B. Bischl, H. Trautmann, M. Wagner, J. Bossek, and F. Neumann,
“A Novel Feature-Based Approach to Characterize Algorithm Performance for
the Traveling Salesperson Problem,”
Annals of Mathematics and Artificial Intelligence, vol. 69, pp. 151 – 182, October 2013.
-  F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, “Algorithm Runtime Prediction: Methods & Evaluation,” Artificial Intelligence Journal, vol. 206, pp. 79 – 111, 2014.
-  G. Ochoa, S. Verel, F. Daolio, and M. Tomassini, “Local Optima Networks: A New Model of Combinatorial Fitness Landscapes,” in Recent Advances in the Theory and Application of Fitness Landscapes, ser. Emergence, Complexity and Computation. Springer, 2014, pp. 233 – 262.
-  J. Pihera and N. Musliu, “Application of Machine Learning to Algorithm Selection for TSP,” in Proceedings of the IEEE 26th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE press, 2014.
-  F. Daolio, A. Liefooghe, S. Verel, H. Aguirre, and K. Tanaka, “Problem features vs. algorithm performance on rugged multi-objective combinatorial fitness landscapes,” Evolutionary Computation, 2016.
-  T. Jones and S. Forrest, “Fitness Distance Correlation as a Measure of Problem Difficulty for Genetic Algorithms,” in Proceedings of the 6th International Conference on Genetic Algorithms (ICGA). Morgan Kaufmann Publishers Inc., 1995, pp. 184 – 192. [Online]. Available: http://dl.acm.org/citation.cfm?id=657929
-  M. Lunacek and L. D. Whitley, “The Dispersion Metric and the CMA Evolution Strategy,” in Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2006, pp. 477 – 484.
-  K. M. Malan and A. P. Engelbrecht, “Quantifying Ruggedness of Continuous Landscapes Using Entropy,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2009, pp. 1440 – 1447.
-  C. L. Müller and I. F. Sbalzarini, “Global Characterization of the CEC 2005 Fitness Landscapes Using Fitness-Distance Analysis,” in Proceedings of the European Conference on the Applications of Evolutionary Computation (EvoApplications), ser. Lecture Notes in Computer Science. Springer, April 2011, pp. 294 – 303.
-  T. Abell, Y. Malitsky, and K. Tierney, “Features for Exploiting Black-Box Optimization Problem Structure,” in Proceedings of 7th International Conference on Learning and Intelligent Optimization (LION), G. Nicosia and P. Pardalos, Eds. Springer, January 2013, pp. 30 – 36.
-  M. A. Muñoz Acosta, M. Kirley, and S. K. Halgamuge, “Exploratory Landscape Analysis of Continuous Space Optimization Problems Using Information Content,” IEEE Transactions on Evolutionary Computation, vol. 19, no. 1, pp. 74 – 87, 2015.
-  R. Morgan and M. Gallagher, “Analysing and Characterising Optimization Problems Using Length Scale,” Soft Computing, pp. 1 – 18, 2015.
-  K. M. Malan, J. F. Oberholzer, and A. P. Engelbrecht, “Characterising Constrained Continuous Optimisation Problems,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2015, pp. 1351–1358.
-  S. Shirakawa and T. Nagao, “Bag of local landscape features for fitness landscape analysis,” Soft Computing, vol. 20, no. 10, pp. 3787–3802, 2016.
P. Kerschke, M. Preuss, C. I. Hernández Castellanos, O. Schütze, J.-Q.
Sun, C. Grimme, G. Rudolph, B. Bischl, and H. Trautmann, “Cell Mapping
Techniques for Exploratory Landscape Analysis,” in
EVOLVE – A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation V. Springer, 2014, pp. 115 – 131.
-  P. Kerschke, M. Preuss, S. Wessing, and H. Trautmann, “Detecting Funnel Structures by Means of Exploratory Landscape Analysis,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2015, pp. 265 – 272.
-  ——, “Low-Budget Exploratory Landscape Analysis on Multiple Peaks Models,” in Proceedings of the 18th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2016.
-  G. VanRossum and The Python Development Team, The Python Language Reference – Release 3.5.0. Python Software Foundation, 2015.
-  MATLAB, Version 8.2.0 (R2013b). Natick, MA, USA: The MathWorks Inc., 2013.
-  R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2017. [Online]. Available: http://www.R-project.org/
-  P. Kerschke, flacco: Feature-Based Landscape Analysis of Continuous and Constrained Optimization Problems, 2017, R-package version 1.6. [Online]. Available: https://cran.r-project.org/package=flacco
-  P. Kerschke and H. Trautmann, “The R-Package FLACCO for Exploratory Landscape Analysis with Applications to Multi-Objective Optimization Problems,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2016.
-  C. Hanster and P. Kerschke, “flaccogui: Exploratory Landscape Analysis for Everyone,” in Proceedings of the 19th Annual Conference on Genetic and Evolutionary Computation (GECCO) Companion. ACM, 2017.
-  J. Bossek, smoof: Single and Multi-Objective Optimization Test Functions, 2016, R-package version 1.4. [Online]. Available: https://github.com/jakobbossek/smoof
-  N. Hansen, A. Auger, O. Mersmann, T. Tušar, and D. Brockhoff, “COCO: A Platform for Comparing Continuous Optimizers in a Black-Box Setting,” CoRR, vol. abs/1603.08785, 2016. [Online]. Available: http://arxiv.org/abs/1603.08785
-  N. Hansen, A. Auger, S. Finck, and R. Ros, “Real-Parameter Black-Box Optimization Benchmarking 2009: Experimental Setup,” INRIA, Tech. Rep. RR-6828, 2009. [Online]. Available: https://hal.inria.fr/inria-00362649v3/document
-  S. Finck, N. Hansen, R. Ros, and A. Auger, “Real-Parameter Black-Box Optimization Benchmarking 2010: Presentation of the Noiseless Functions,” University of Applied Science Vorarlberg, Dornbirn, Austria, Tech. Rep., 2010. [Online]. Available: http://coco.lri.fr/downloads/download15.03/bbobdocfunctions.pdf
-  B. Beachkofski and R. Grandhi, “Improved Distributed Hypercube Sampling,” in Proceedings of the 43rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, AIAA paper 2002-1274. American Institute of Aeronautics and Astronautics, 2002.
-  S. Swarzberg, G. Seront, and H. Bersini, “S.T.E.P.: The Easiest Way to Optimize a Function,” in Proceedings of the First IEEE Congress on Evolutionary Computation (CEC). IEEE, June 1994, pp. 519 – 524.
-  R. P. Brent, Algorithms for Minimization Without Derivatives. Courier Corporation, 2013.
-  P. Baudiš and P. Pošík, “Global Line Search Algorithm Hybridized with Quadratic Interpolation and its Extension to Separable Functions,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). New York, NY, USA: ACM, 2015, pp. 257 – 264.
-  P. Pošík and P. Baudiš, “Dimension Selection in Axis-Parallel Brent-Step Method for Black-Box Optimization of Separable Continuous Functions,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). New York, NY, USA: ACM, 2015, pp. 1151 – 1158. [Online]. Available: http://pasky.or.cz/sci/dimsel.pdf
-  L. Pál, “Comparison of Multistart Global Optimization Algorithms on the BBOB Noiseless Testbed,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1153 – 1160. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0303-pal.pdf
-  A. H. G. Rinnooy Kan and G. T. Timmer, “Stochastic Global Optimization Methods Part II: Multi Level Methods,” Mathematical Programming, vol. 39, no. 1, pp. 57–78, 1987.
-  C. G. Broyden, “The Convergence of a Class of Double-Rank Minimization Algorithms 2. The New Algorithm,” Journal of Applied Mathematics, vol. 6, no. 3, pp. 222 – 231, 1970.
-  W. Huyer and A. Neumaier, “Benchmarking of MCS on the Noiseless Function Testbed,” in Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2009. [Online]. Available: http://www.mat.univie.ac.at/%7Eneum/ms/mcs_exact.pdf
-  N. Hansen and A. Ostermeier, “Completely Derandomized Self-Adaptation in Evolution Strategies,” Evolutionary Computation Journal, vol. 9, no. 2, pp. 159 – 195, 2001.
-  A. Atamna, “Benchmarking IPOP-CMA-ES-TPA and IPOP-CMA-ES-MSR on the BBOB Noiseless Testbed,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). New York, NY, USA: ACM, 2015, pp. 1135 – 1142. [Online]. Available: https://www.lri.fr/%7Eatamna/atamna_BBOB_15.pdf
-  S. van Rijn, H. Wang, M. van Leeuwen, and T. Bäck, “Evolving the Structure of Evolution Strategies,” in Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, December 2016.
-  A. Auger, D. Brockhoff, and N. Hansen, “Benchmarking the Local Metamodel CMA-ES on the Noiseless BBOB’2013 Test Bed,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1225 – 1232. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0313-auger.pdf
-  A. Auger and N. Hansen, “A Restart CMA Evolution Strategy with Increasing Population Size,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC), vol. 2. IEEE, September 2005, pp. 1769 – 1776.
-  I. Loshchilov, M. Schoenauer, and M. Sebag, “Bi-Population CMA-ES Algorithms with Surrogate Models and Line Searches,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1177 – 1184. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0306-loshchilov.pdf
-  N. Hansen, “Benchmarking a BI-Population CMA-ES on the BBOB-2009 Function Testbed,” in Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (GECCO) Companion: Late Breaking Papers. ACM, July 2009, pp. 2389 – 2396.
-  I. Loshchilov, M. Schoenauer, and M. Sebag, “Intensive Surrogate Model Exploitation in Self-adaptive Surrogate-assisted CMA-ES (saACM-ES),” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 439 – 446.
-  M. Powell, “The NEWUOA Software for Unconstrained Optimization Without Derivatives,” Large-scale Nonlinear Optimization, pp. 255 – 297, 2006.
-  F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential Model-Based Optimization for General Algorithm Configuration,” in Proceedings of 5th International Conference on Learning and Intelligent Optimization (LION), C. A. Coello Coello, Ed. Springer, January 2011, pp. 507 – 523.
-  F. Hutter, H. Hoos, and K. Leyton-Brown, “An Evaluation of Sequential Model-Based Optimization for Expensive Blackbox Functions,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1209 – 1216. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0311-hutter.pdf
-  C. E. Rasmussen and C. K. Williams, Gaussian Processes for Machine Learning. MIT press Cambridge, 2006, vol. 1.
-  D. R. Jones, C. D. Perttunen, and B. E. Stuckman, “Lipschitzian Optimization without the Lipschitz Constant,” Journal of Optimization Theory and Applications, vol. 79, no. 1, pp. 157 – 181, 1993.
-  Z. Ugray, L. Lasdon, J. Plummer, F. Glover, J. Kelly, and R. Martí, “Scatter Search and Local NLP Solvers: A Multistart Framework for Global Optimization,” INFORMS Journal on Computing, vol. 19, no. 3, pp. 328 – 340, 2007.
-  M. Laguna and R. Martí, “The OptQuest Callable Library,” in Optimization Software Class Libraries. Springer, 2003, pp. 193 – 218.
-  B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, Z. M. Jones, and G. Casalicchio, mlr: Machine Learning in R, 2016, R-package version 2.9. [Online]. Available: https://github.com/mlr-org/mlr
-  T. Therneau, B. Atkinson, and B. Ripley, rpart: Recursive Partitioning and Regression Trees, 2017, r package version 4.1-11. [Online]. Available: https://CRAN.R-project.org/package=rpart
-  A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis, “kernlab – An S4 Package for Kernel Methods in R,” Journal of Statistical Software, vol. 11, no. 9, pp. 1 – 20, 2004. [Online]. Available: http://www.jstatsoft.org/v11/i09/
-  A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2, no. 3, pp. 18 – 22, 2002. [Online]. Available: http://CRAN.R-project.org/doc/Rnews/
-  T. Chen, T. He, M. Benesty, V. Khotilovich, and Y. Tang, xgboost: Extreme Gradient Boosting, 2017, R-package version 0.6-4. [Online]. Available: https://CRAN.R-project.org/package=xgboost
-  S original by Trevor Hastie and Robert Tibshirani. Original R port by Friedrich Leisch and Kurt Hornik and Brian D. Ripley., mda: Mixture and Flexible Discriminant Analysis, 2016, R-package version 0.4-9. [Online]. Available: https://CRAN.R-project.org/package=mda
-  B. Bischl, P. Kerschke, L. Kotthoff, T. M. Lindauer, Y. Malitsky, A. Fréchette, H. H. Hoos, F. Hutter, K. Leyton-Brown, K. Tierney, and J. Vanschoren, “ASlib: A Benchmark Library for Algorithm Selection,” Artificial Intelligence Journal, vol. 237, pp. 41 – 58, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0004370216300388