I Introduction
Although the Algorithm Selection Problem (ASP, [1]) has been introduced more than four decades ago, there only exist few works (e.g., [2, 3]), which perform algorithm selection in the field of continuous optimization. Independent of the underlying domain, the goal of the ASP can be described as follows: given a set of optimization algorithms , often denoted algorithm portfolio, and a set of problem instances , one wants to find a model that selects the best algorithm from the portfolio for an unseen problem instance . Albeit there already exists a plethora of optimization algorithms – even when only considering singleobjective, continuous optimization problems – none of them can be considered to be superior to all the other ones across all optimization problems. Hence, it is very desirable to find a sophisticated selection mechanism, which automatically picks the portfolio’s best solver for a given problem.
Within other optimization domains, such as the wellknown Travelling Salesperson Problem, featurebased algorithm selectors have already shown their capability of outperforming the respective stateoftheart optimization algorithm(s) by combining machine learning techniques and problem dependent features [4, 5]. As schematized in Figure 1, we now transfer the respective idea of using instancespecific features to singleobjective continuous optimization problems based on Exploratory Landscape Analysis (ELA, [6]) for leveraging solver complementarity of a welldesigned algorithm portfolio.
As we show within this work, the integration of ELA results in strong performance improvements (averaged across the entire benchmark) over any of the portfolio’s single solvers. More precisely, our selector requires only half of the resources needed by the portfolio’s single best solver. Hence, our model strongly reduces the gap towards the idealistic – and thus, from a realistic point of view unreachable – virtual best solver.
A more detailed overview of Exploratory Landscape Analysis, as well as an introduction into flacco – an R toolbox for computing a variety of such landscape features enhanced by a graphical user interface – is given in Section II. In Section III, we give more insights into the COCO platform and afterwards describe our experimental setup, including the generation of the considered algorithm portfolio, in Section IV. An analysis of our found algorithm selection models is given in Section V and Section VI concludes our work.
Ii Exploratory Landscape Analysis
While problemdependent (landscape) features can in general be computed for any optimization problem (e.g., [7, 8, 9, 10, 11]), we will only consider singleobjective, continuous optimization problems within this work. For this domain, Mersmann et al. introduced a sophisticated approach for characterizing a problem’s landscape by means of numerical feature values and called it “Exploratory Landscape Analysis” [6]. Within their work, they designed a total of 50 numerical measures and grouped them into six categories of socalled “lowlevel” features: convexity, curvature, level set, local search, meta model and ydistribution features. These numbers were then used to characterize eight socalled “highlevel” properties, such as the degree of multimodality, the separability, the basin size homogeneity or the number of plateaus. However, given that these properties (a) require expert knowledge and consequently can not be computed automatically, and (b) are categorical and thus, make it impossible to e.g., distinguish problems by their degree of multimodality, the introduction of the lowlevel features can be seen as a major step towards automatically computable landscape features and hence automated algorithm selection.
Iia Background
Already in the years before the term ELA was introduced, researchers have tried to characterize a problem’s landscape by numerical values: Jones and Forrest assessed a problem’s difficulty by means of a fitness distance correlation [12], Lunacek and Whitley introduced a dispersion metric [13], Malan and Engelbrecht quantified the landscape’s ruggedness [14] and Müller and Sbalzarini performed fitnessdistance analyses [15].
However, after Bischl et al. [2] have shown the potential of landscape analysis by using ELA features for algorithm selection, a manifold of new landscape features has been introduced. Abell et al. used hill climbing characteristics [16], Muñoz et al. measured a landscape’s information content [17] and Morgan and Gallagher analyzed the problems with the help of length scale features [18]. More recently, Malan et al. characterized constraint optimization problems [19], and Shirakawa and Nagao introduced an entire bag of local landscape features [20].
In other works, research groups, which also include this paper’s authors, have designed features based on a discretization of the continuous search space into a grid of cells [21], and successfully employed the nearest better clustering approach for distinguishing funnelshaped landscapes with a global structure from problems, whose local optima are aligned in a more random manner [22]. This particularly facilitates the decision of the class of algorithms which is most suited for the problem at hand. We also showed that even low budgets of observations (
being the problem dimensionality) – i.e., a sample size that is close to the size of an evolutionary algorithm’s initial population – is sufficient for such a distinction
[23]. In consequence, the evaluation of this initial sample, which is required for the landscape features, would come without any additional costs, given that the evolutionary algorithm would have to evaluate those points anyways.IiB Flacco
In the previous subsection, we have provided an overview of numerous landscape features. Unfortunately, those features were usually – if at all – implemented in different programming languages, such as Python [24], Matlab [25] or R [26], making it extremely complicated to use all of them within a single experiment. This obstacle has been solved (for Rusers) with the development of flacco [27], an Rpackage for featurebased landscapeanalysis of continuous and constrained optimization problems. The package (currently) provides a collection of more than 300 landscape features (including the majority of the ones from above), distributed across a total of 17 different feature sets. In addition, the package comes with several visualization techniques, which should help to facilitate the understanding of the underlying problem landscapes [28]. One can either use the package’s stable release from CRAN^{1}^{1}1https://cran.rproject.org/package=flacco or its developmental version from GitHub^{2}^{2}2https://github.com/kerschke/flacco. Note that the latter also provides further information on the usage of flacco, as well as a link to its onlinetutorial^{3}^{3}3http://kerschke.github.io/flaccotutorial/site/.
Being aware of possible obstacles for nonRusers, we recently developed a webhosted and hence, platformindependent, graphical user interface (GUI)^{4}^{4}4http://www.flacco.shinyapps.io/flacco of flacco [29]. A screenshot of that GUI is displayed in Figure 2. The GUI provides a slim, and thus userfriendly, version of flacco. In its left part (highlighted in grey), the user needs to provide information on the function that should be optimized – either by selecting one of the singleobjective optimization problems available in the Rpackage smoof [30], configuring one of the BBOBfunctions, manually defining a function, or by uploading an evaluated set of points. On the right side, one then can either compute a specific feature set or visualize certain aspects of the optimization problem and/or its features.
Iii Exploratory Data Analysis of COCO
Instead of executing the optimization algorithms ourselves, we use the results from COCO^{5}^{5}5http://coco.gforge.inria.fr/ [31], which is a platform for COmparing Continuous Optimization algorithms on the BlackBox Optimization Benchmark (BBOB [32, 33]). The platform provides a collection of the performance results of 129 optimization algorithms, which have been submitted to the BBOB workshops at the GECCO conference^{6}^{6}6http://sig.sigevo.org/index.html/tikiindex.php?page=GECCOs of the years 2009, 2010, 2012, 2013 and 2015. Using the results published on this platform comes with the advantage that over the years basically all important classes of optimization algorithms, including current stateofthe art optimizers, have been submitted. Thus, experiments are of representative character in a comprehensive setting.
Iiia General Setup of the COCOPlatform
The competition settings were quite similar across the years: per dimension and function , the participants had to submit the results for a total of 15 problem instances. As shown below, only five instances (IIDs 1 to 5) were part of each competition, while the remaining ten instances changed per year:

2009: IIDs 1 to 5 (with 3 replications each)

2010: IIDs 1 to 15 (1 replication each)

2012: IIDs 1 to 5, 21 to 30 (1 replication each)

2013: IIDs 1 to 5, 31 to 40 (1 replication each)

2015: IIDs 1 to 5, 41 to 50 (1 replication each)
For each pair of problem instance and optimization algorithm , the submitted data contains a log of the performed number of function evaluations and the corresponding achieved fitness value, enabling an a posteriori evaluation of the solver’s performance. More precisely, this data allows to decide whether the solver was successful^{7}^{7}7 is , if the solver finds a solution , whose fitness value lies within ; otherwise . to approximate the (known) global optimum of instance up to a precision value and also, how many function evaluations were performed until the solver (un)successfully terminated. These values were then used to compute the solver’s Expected Runtime (ERT) [32]:
Although the aforementioned setup comes with some drawbacks, we will still use it as it has been wellestablished in the community and thus allows for wide comparability. Nevertheless, we would like to address its pitfalls as well: (a) Given the strong variability across the instances’ objective values, it is not straightforward to use a fixed absolute precision value for comparing the solvers across all instances. Instead, it might be more reasonable to use a relative threshold. (b) Computing the ERT based on different instances – even if they belong to the same BBOB problem – is very questionable as the instances can have completely different landscapes: if the original instance is a highly multimodal function, the transformed instance^{8}^{8}8Per definition, instances are identical up to rotation, scaling and shifts [33]. could – if at all – consist of only a few peaks. Therefore, we strongly encourage to ask for several solver runs on the same instance in future setups, i.e., perform (at least five to ten) replicated runs, and then evaluate the performance results per instance rather than per function allowing for ERT computations on function and instance level.
IiiB Preprocessing COCO’s Solver Performances
We will use the performance results of the 129 optimization algorithms available at COCO^{9}^{9}9An overview of all submitted optimization algorithms along with their descriptions can be found at http://coco.gforge.inria.fr/doku.php?id=algorithms.. However, in order to work with a valid data base, we first performed some sanity checks on the platform’s data before combining it in a joint data base.
While the submitted results of all 29 (2009) and 23 (2010) solvers of the first two competitions passed these checks, in the following years, only 22 of 28 (2012), 16 of 31 (2013) and 13 of 18 (2015) submissions did so. The invalid submissions partially used settings of the previous years. However, in order to use the most general portfolio of solvers, we only considered the submissions for IIDs 1 to 5 with only one run per instance, as this is the only set of instances that was used across all five BBOB competitions. Fortunately, the performances of all 129 solvers could be considered for our experiments, because even the problematic submissions from above had valid data for the first five instances.
Iv Experimental Setup
Iva Algorithm Performance Data
For each of the 129 solvers we computed the ERT per tuple of problem dimension , BBOB problem respectively function and problem instance (if multiple runs exist, we only considered the first run), resulting in a total of observations. The ERTs were computed for a precision threshold of , because smaller values led to too many unsuccessful runs. Even for this chosen precision, only approximately of all (considered) runs terminated successfully.
IvB Instance Feature Data
Each of the ELA feature sets was computed using a socalled improved latin hypercube design [34] consisting of observations, which were sampled across the decision space, i.e., . The feature sets were then computed using the Rpackage flacco [27] for all four problem dimensions, 24 BBOB problems and five problem instances that were used by the performance data (see Section IVA). For each of these 480 observations, we calculated the six classical ELA feature sets (convexity, curvature, levelset, local search, metamodel and ydistribution) [6], as well as the basic, (cell mapping) angle^{10}^{10}10
The cell mapping angle features were computed using three blocks per dimension, as larger values would result in too many empty cells due to the “curse of dimensionality”.
[21], dispersion [13], information content [17], nearest better clustering [22] and principal component features, resulting in a total of 102 features per problem instance.Although being conscious of the resulting information loss, we aggregate each feature across the five problem instances (per BBOB problem) via the median of the respective feature values, in order to map our feature data to the 96 observations (24 problems, four dimensions) of the performance data.
IvC Constructing the Algorithm Portfolio
For meaningfully addressing the algorithm selection task, the choice of the underlying algorithm portfolio is crucial. Ideally, the considered set should be as small and as complimentary as possible and should include stateofthe art optimizers. For this purpose, we ranked the solvers per considered BBOB problem based on ERT performance. We then constructed four solver sets (one per dimension), each of them containing the solvers that ranked within the “Top 3” of at least one of the 24 functions of the respective dimension. Based on these four solver sets – consisting of between 37 and 41 solvers each – a portfolio of 12 solvers was constructed by only considering optimizers that belonged to each of the four sets.
The 12 optimization algorithms from the found portfolio can be grouped into four categories and are summarized below.
IvC1 Deterministic Optimization Algorithms (2)
The two solvers of this category are variants of the BrentSTEP algorithm^{11}^{11}11The BrentSTEP algorithm itself accelerates the global line search method STEP (“select the easiest point”) [35] by using Brent’s method [36]. [37]. It performs axisparallel searches and chooses the next iteration’s search dimension either using a roundrobin (BSrr, [38]
) or a quadratic interpolation strategy (
BSqi, [38]).IvC2 MultiLevel Approaches (5)
The origin of most solvers belonging to this category is the multi level single linkage method (MLSL, [39, 40]). It is a stochastic, multistart, global optimizer that relies on random sampling and local searches. Aside from MLSL itself, some of its variants also belong to our portfolio: an interiorpoint version for constrained nonlinear problems (fmincon, [39]), a quasiNewton version, which approximates the Hessian using BFGS [41] (fminunc, [39]), and a hybrid variant whose most important improvements are related to its sampling phase (HMLSL, [39]). The final optimizer belonging to this group is the multilevel coordinate search (MCS, [42]), which splits the search space into smaller boxes – each containing a known observation – and then starts local searches from promising boxes.
IvC3 Variants of the CMAES (4)
In 2001, Hansen introduced one of the most popular evolution strategies: the Covariance Matrix Adaption Evolution Strategy (CMAES) [43] with cumulative stepsize adaptation (CMACSA, [44]). It led to a plethora of variants [45], including the following three solvers from our portfolio: (1) IPOP400D [46], a restart version of the CMAES with an increasing population size (IPOPCMAES, [47]) and a maximum of function evaluations. (2) A hybrid CMA (HCMA, [48]), which combines a bipopulation selfadaptive surrogateassisted CMAES^{12}^{12}12A BIPOPCMAES [49] is a multistart CMAES with equal budgets for two interlaced restart strategies: one with an increasing population size and one with varying small population sizes. (BIPOPaACMESk, [50]), STEP [35] and NEWUOA [51] to benefit from surrogate models and line searches simultaneously. (3) A sequential, modelbased algorithm configuration (SMAC, [52]) procedure applied to the BBOB problems (SMACBBOB, [53]). It uses Gaussian processes (GP, [54]) to model the the expected improvement function and then performs one run of DIRECT [55] (with evaluations) and ten runs of the classical CMAES [43] (with evaluations) on the expected improvement function.
IvC4 Others (1)
The final optimizer from our portfolio is called OptQuest/NLP (OQNLP, [39, 56]
). It is a commercial, heuristic, multistart algorithm that was designed to find the global optima of smooth constrained nonlinear programs (NLPs) and mixed integer nonlinear programs (MINLPs). The algorithm uses the
OptQuest Callable Library (OCL, [57]) to generate candidate starting points for a local NLP solver.IvD Machine Learning Algorithms
We considered three classes of supervised learning strategies for training our algorithm selection models: (1) A
classification approach, which simply tries to predict the bestperforming optimizer^{13}^{13}13Ties are broken via random uniform sampling among the tied solvers. and hence, completely ignores the magnitude of performance differences between the best and the remaining portfolio solvers. (2) A regression approach, which trains a separate model for the performances of each optimizer and afterwards predicts the solver with the best predicted performance. (3) In addition to these wellknown strategies, we also considered the socalled pairwise regression, which led to promising results in other works (e.g., [4, 5]). In contrast to modelling the performances straightforward (as in (2)), it models the performance differences for each solver pair and afterwards predicts the solver whose predicted performance difference was the highest, compared to all other solvers.The algorithm selectors were trained in R [26] using the Rpackage mlr [58]. For each class of the supervised learning approaches, we considered recursive partitioning and regression trees (rpart, [59]
), kernelbased support vector machines (
ksvm, [60]), random forests (
randomForest, [61]) and extreme gradient boosting (
xgboost, [62]). Additionally, we also tried multivariate adaptive regression splines (
mars, [63]) in case of the (paired) regression approaches.Note that the SVM’s inverse kernel width sigma
was the only hyperparameter that was (automatically) configured – using the
sigest function from the Rpackage kernlab [60]. All other hyperparameters were used in their default settings: the SVMs used Gaussian radial basis kernels and the random forests were constructed using 500 trees, whose split points were sampled random uniformly of (classification) or (regression / paired regression) features with being the data set’s number of (ELA) features.IvE Feature Selection Strategies
The algorithm selectors are initially trained using all 102 features. However, using all of the features simultaneously likely causes lots of noise and/or redundancy, which could lead to poorly performing algorithm selectors. Furthermore, some of the feature sets, namely, the convexity, curvature and local search features from the classical ELA features [6]
, require additional function evaluations on top of the costs for the initial design. In order to overcome these obstacles, we used the following four feature selection strategies.
IvE1 sffs
A greedy forwardbackward selection, which starts with an empty feature set and iteratively alternates between greedily adding and/or removing features as long as this results in an improvement of the model’s performance.
IvE2 sfbs
This strategy works analogously to the previous one, but starts with the full set of all 102 features and afterwards alternates between removing and adding features.
IvE3 Ga
A genetic algorithm with population size 10 and 5 offsprings per generation. Here, the selected features are represented as a 102dimensional bit string, where a value of 1 at position
implies that the th feature is selected. The GA runs for a maximum of 100 generations and selects the features by performing random bit flips – with a (default) mutation rate of and a crossover rate of .IvE4 Ga
A modified version of the previous GA, using a tenfold of offsprings (50 instead of 5) per generation.
IvF Performance Assessment
In lieu of using the ERT itself, we will use the relative ERT (relERT) for our experiments. While the former strongly biases the algorithm selectors towards multimodal and higherdimensional problems due to much larger amounts of used function evaluations, the relERTs, which are also used within the BBOB competitions, allow a fair comparison of the solvers across the problems and dimensions by normalizing each solver’s ERT with the ERT of the best solver for the respective BBOB problem (of that dimension). Instead of scaling each performance with the respective best ERT of all 129 solvers, we used the best performance from the 12 solvers of the considered portfolio as this is our set of interest in this study.
As some solvers did not even solve a single instance from some of the BBOB functions, the respective ERTs and relERTs were not defined. We thus imputed the missing relERTs using the wellknown PAR10 score (e.g.,
[64]). That is, each of the problematic values is replaced with a penalty value () that is the tenfold of the highest valid relative ERT; for all other values, the respective (finite) relERTs are used.For each of the supervised learning approaches (see Section IVD), the algorithm selectors are evaluated using the mean relative ERT
, which averages the relERTs of the predictions (including the costs for the initial design) on the corresponding test data, i.e., a subset of the BBOB problems. In order to obtain more realistic and reliable estimates of the selectors’ performances, they were assessed using leaveone(function)out crossvalidation. That is, per algorithm selector we train a total of 96 submodels. Each of them uses only 95 of the 96 BBOB problems for training and the remaining one for testing. Note that each problem was used exactly once as test data. Consequently, within each iteration/fold of the leaveone(function)out crossvalidation, exactly one problem (i.e., the respective test data) is completely kept out of the modeling phase and only used for assessing the respective submodel’s performance. The average of the resulting 96 relERTs is then used as our algorithm selector’s performance.
Following common practices in algorithm selection (e.g., [64]), we compare the performances of our algorithm selectors with two baselines: the virtual best solver (VBS) and the single best solver (SBS). The virtual best solver, sometimes also called oracle or perfect selector, provides a lower bound for the selectors as it shows the performance that one could (theoretically) achieve on the data, when always selecting the best performing algorithm per instance. Given that the relERT has a lower bound of 1 and that at least one solver achieves that perfect performance per instance, the VBS has to be 1 per definition. Nevertheless, it is quite obvious that algorithm selectors usually do not reach such an idealistic performance given their usually imperfect selections and/or the influence of the costs for computing the landscape features.
The second baseline, the SBS, represents the (aggregated) performance of the (single) best solver from the portfolio. Consequently, this baseline is much more important than the VBS, because an algorithm selector is only useful if its performance (including feature costs) is better than the SBS. Note that in principle, the SBS could either be determined on the entire data set or, for (leaveoneout) crossvalidation, be computed per fold. However, within our experiments, both approaches result in identical performances.
V Results
Va Analyzing the Algorithm Portfolio
Relative Expected Runtime of the 12 Solvers from the Considered Algorithm Portfolio and the 2 Presented Algorithm SelectionModels  
Dim  BBOB  BSqi  BSrr  CMA  fmincon  fminunc  HMLSL  IPOP  MCS  MLSL  OQNLP  SMAC  ASModel  
Group  CSA  HCMA  400D  BBOB  # 1  # 2  
2  F1  F5  1.2  1.3  54.8  11.0  11.8  3.7  14.6  18.4  5.8  15.5  17.0  22 014.9  16.6  20.3 
F6  F9  18 516.7  9 708.2  7.4  18.6  19.2  5.8  1.7  5.7  11.3  24.2  1.5  27 518.6  3.1  3.5  
F10  F14  7 649.2  7 481.5  8.3  1.0  62.7  6.3  1.0  10.7  322.7  1.0  4.9  29 353.2  4.7  4.0  
F15  F19  7 406.6  14 710.3  14.7  7 392.0  7 367.7  25.3  8.1  15.5  7.7  7 391.7  7 351.2  29 354.8  26.2  10.1  
F20  F24  84.8  14 768.5  7 351.9  4.1  14.5  44.9  3.9  14 679.3  11.4  2.1  2.7  22 014.6  42.5  3.0  
all  6 240.7  9 318.4  1 549.1  1 546.5  1 556.7  17.7  6.0  3 068.4  74.3  1 547.9  1 536.9  25 990.1  19.3  8.4  
3  F1  F5  1.3  1.3  7 367.9  85.2  132.1  356.1  6.8  14 686.6  45.9  55.9  7 347.6  22 015.1  58.4  94.9 
F6  F9  331.2  9 527.4  4.7  38.5  9 173.7  4.5  1.9  6.5  31.4  9 173.4  2.5  36 690.3  3.3  39.9  
F10  F14  29 356.3  14 712.1  8.9  1.0  4.1  5.0  1.0  12.3  8 132.7  1.0  9.3  29 353.4  4.8  3.6  
F15  F19  14 698.2  22 026.2  1.6  14 701.2  14 699.5  2.6  11.4  7 339.4  7 346.9  14 700.0  14 686.2  36 690.3  2.8  7.1  
F20  F24  14 741.8  14 758.7  7 389.4  7 339.6  14 677.4  66.8  2.3  22 015.1  7 342.4  7 339.8  1.9  22 014.8  67.0  3.4  
all  12 304.7  12 316.7  3 077.4  4 616.2  7 677.5  90.4  4.8  9 178.9  4 769.4  6 132.4  4 593.1  29 047.1  28.3  29.4  
5  F1  F5  1.4  1.4  7 533.6  14 678.4  14 679.2  12.0  17.5  14 688.7  14 678.1  14 678.5  14 678.0  22 015.1  22.7  22.9 
F6  F9  27 597.4  36 690.3  5.6  9 173.5  9 173.8  3.9  2.4  4.9  28.8  9 173.4  9 173.5  36 690.3  4.8  4.8  
F10  F14  22 032.8  29 360.3  8.9  1.0  11.9  4.2  1.0  13.6  22 019.2  1.0  10.7  36 690.3  5.2  5.2  
F15  F19  36 690.3  36 690.3  3.1  36 690.3  36 690.3  4.3  7 346.1  29 352.5  36 690.3  36 690.3  29 352.5  36 690.3  4.4  4.4  
F20  F24  22 053.6  22 050.8  7 400.0  14 678.9  22 014.9  7.7  7 339.8  22 017.4  14 681.0  22 015.0  14 676.8  22 014.9  7.8  7.8  
all  21 428.3  24 469.8  3 114.6  15 289.0  16 819.9  6.5  3 063.8  13 765.8  18 352.4  16 817.4  13 761.8  30 575.6  9.1  9.2  
10  F1  F5  1.6  1.6  14 691.0  14 679.9  14 682.7  2.7  7 365.5  14 698.8  14 680.0  14 679.9  14 678.3  22 015.7  16.3  16.3 
F6  F9  36 690.3  27 563.9  4.3  9 173.4  9 173.8  2.2  4.1  9 181.9  9 188.1  9 173.4  9 173.9  36 690.3  2.7  2.7  
F10  F14  29 359.3  29 359.8  8.4  1.1  15.4  2.8  1.1  7 352.5  22 018.7  1.1  12.0  36 690.3  3.7  3.7  
F15  F19  36 690.3  36 690.3  1.7  36 690.3  36 690.3  2.0  22 028.5  29 352.5  36 690.3  36 690.3  36 690.3  36 690.3  2.1  2.1  
F20  F24  36 690.3  29 367.0  14 685.9  22 015.2  22 015.0  23.6  14 677.1  29 352.8  22 018.9  22 014.6  22 014.9  36 690.3  23.7  23.7  
all  27 519.5  24 472.9  6 123.0  16 817.8  16 821.3  6.9  9 182.4  18 354.6  21 408.0  16 817.6  16 819.7  33 633.1  10.0  10.0  
all  F1  F5  1.4  1.4  7 411.8  7 363.6  7 376.5  93.6  1 851.1  11 023.1  7 352.4  7 357.4  9 180.2  22 015.2  28.5  38.6 
F6  F9  20 783.9  20 872.4  5.5  4 601.0  6 885.1  4.1  2.5  2 299.8  2 314.9  6 886.1  4 587.9  34 397.4  3.5  12.7  
F10  F14  22 099.4  20 228.4  8.7  1.0  23.5  4.6  1.0  1 847.3  13 123.3  1.0  9.3  33 021.8  4.6  4.1  
F15  F19  23 871.3  27 529.3  5.2  23 868.5  23 861.9  8.6  7 348.5  16 515.0  20 183.8  23 868.1  22 020.0  34 856.4  8.9  5.9  
F20  F24  18 392.6  20 236.3  9 206.8  11 009.4  14 680.5  35.8  5 505.8  22 016.2  11 013.4  12 842.9  9 174.1  25 683.7  35.3  9.5  
all  16 873.3  17 644.5  3 466.0  9 567.4  10 718.9  30.4  3 064.3  11 091.9  11 151.0  10 328.8  9 177.9  29 811.5  16.7  14.2 
Here, we examine the constructed portfolio to get more insights into the data provided by COCO. In a first step, we compare the best performances of the portfolio solvers to the best performances of all 129 solvers. For 51 of the 96 considered BBOB functions over the dimensions, the portfolio’s VBS is identical to the VBS of all COCOsolvers.
The violin plot on the right side of Figure 3
– which can be understood as a boxplot, whose shape represents a kernel density estimation of the underlying observations – indicates that the ratios between the best ERTs from the portfolio and all COCOsolvers (per BBOB problem) is usually close to one. More precisely, only ten BBOB problems, depicted by black dots in the boxplot of Figure
3, have a ratio greater than two, and only three of them – the 2Dversion of Schaffers F7 Function (FID 17, [33]), as well as the 5D and 10Dversions of the Lunacek biRastrigin Function (FID 24, [33]) – have ratios worse than three. Consequently, we can conclude that if we find a wellperforming algorithm selector, its suggested optimization algorithm likely requires less than twice the number of function evaluations compared to the best algorithm from the entire COCO platform.As mentioned in Section IVF, we had to impute some of the performances, because not all optimizers solved at least one instance per BBOB problem. More precisely, HCMA was the only one to find at least one valid solution for all 96 problems, followed by HMLSL (88), CMACSA (87), OQNLP (72), fmincon (71), MLSL (69), fminunc (68), IPOP400D and MCS (67), BSqi (52), BSrr (50) and SMACBBOB (18). Hence, the mean relERT of HCMA (30.37) was by far the lowest of all considered solvers, making this solver the clear SBS.
However, as shown in Table I, HCMA is not superior to the remaining solvers across all problems. Independent of the problem dimensions, the BrentSTEP variants BSqi and BSrr outperform the other optimizers on the separable problems (FIDs 1 to 5). Similarly, the multilevel methods MLSL, fmincon and HMLSL are superior on the unimodal functions with high conditioning (FIDs 10 to 14), and the multimodal problems with adequate global structure (FIDs 15 to 19) are solved fastest by MCS (in 2D) and CMACSA (3D to 10D). The remaining problems, i.e., functions with low or moderate conditioning (FIDs 6 to 9), as well as multimodal functions with a weak global structure (FIDs 20 to 24) do not show such obvious patterns. Nevertheless one can see that HCMA, HMLSL, OQNLP and sometimes also MLSL perform quite promising on these function groups.
The fact that HCMA often is inferior to the other 11 solvers clearly indicates that the data contains sufficient potential for improving over the SBS – e.g., with a reasonable algorithm selector. This finding is supported by the small amount of points located exactly on the diagonal of the scatterplots in Figure 4, which depict the ERTpairs of the two baseline algorithms for all 24 BBOB problems and per problem dimension. Noticeably, though not very surprising, the VBS and SBS always have higher ERTs on the multimodal problems (FIDs 15 to 24) than on the unimodal problems (FIDs 1 to 14). Also, the shape of the cloud of observations indicates that the problems’ complexity grows with its dimension: while the 2Dobservations are mainly clustered in the lower left corner (indicating rather easy problems), the cloud stretches along the entire diagonal for higherdimensional problems. Furthermore it is obvious that two problems, FIDs 1 (Sphere, indicated by ) and 5 (Linear Slope, ), are consistently solved very fast (independent of its dimensionality), whereas FID 24 (Lunacek BiRastrigin, ) is always the most difficult problem for the two baseline algorithms.
Looking at the performances of HCMA across all 96 problems, one problem stands out: the threedimensional version of the BücheRastrigin function (FID 4). On this (usually rather simple) function, HCMA has a relERT of 1765.9, compared to at most 51.8 on the non3Dversions of this problem and 249.1 as the worst relERT of all remaining 95 BBOB problems. We therefore had a closer look at the submitted data of HCMA across FID 4 and for all dimensions. The respective data is summarized in Table II. Comparing the results of the 20 optimization runs with each other, three of them seem to be clear outliers. However, it remains unclear whether they were caused by the algorithm itself – it could for instance be stuck in a local optimum without terminating – or whether the outliers were caused by some technical issue on the platform.
Dim  IID  Gap to  # Function  Success?  ERT of VBS  ERT of SBS 

Optimum  Evaluations  (Solver)  (relERT)  
1  8.44e03  215  TRUE  
2  6.91e03  545  TRUE  98.8  405.2  
2  3  8.84e05  440  TRUE  (BSqi)  (4.1) 
4  5.44e03  248  TRUE  
5  4.29e03  578  TRUE  
1  4.48e03  976 690  TRUE  
2  9.99e01  570 925  FALSE  219.6  387 784.5  
3  3  9.62e04  781  TRUE  (BSrr)  (1 765.9) 
4  1.72e03  1 373  TRUE  
5  8.24e03  1 369  TRUE  
1  6.52e03  2 048  TRUE  
2  8.09e04  2 248  TRUE  486.2  25 197.0  
5  3  1.00e+00  91 882  FALSE  (BSrr)  (51.8) 
4  2.16e03  2 382  TRUE  
5  8.54e03  2 228  TRUE  
1  3.75e03  4 253  TRUE  
2  6.72e03  4 495  TRUE  1 067.8  4 286.6  
10  3  2.58e03  4 632  TRUE  (BSrr)  (4.0) 
4  2.79e03  4 032  TRUE  
5  5.43e03  4 021  TRUE 
VB Analyzing the Best Algorithm Selector
After gaining more insights into the underlying data, we use the performance and feature data for training a total of 70 algorithm selectors: 14 machine learning algorithms (see Section IVD), each of them using either one of the four feature selection strategies described in Section IVE or no feature selection at all. The classificationbased support vector machine, which employed a greedy forwardbackward feature selection strategy (sffs), achieved the best performance. While the single best solver (HCMA) had a mean relERT of 30.37, the algorithm selector (denoted as Model 1) had a mean relERT of 16.67 (including feature costs) and 13.60 (excluding feature costs). Therefore, the selector is able to pick an optimizer from the portfolio, which solves the respective problem on average twice as fast.
Looking at the scatterplots shown in the top row of Figure 5, one can see that – with the exception of the 2D Rastrigin function (FID 3, +) – Model 1 predicts on all instances an optimizer that performs at least as good as HCMA. Admittedly, this comparison is not quite fair, as the ERTs shown within the respective row do not consider the costs for computing the landscape features. Hence, a more realistic comparison of the selector and the SBS is shown in the second row of Figure 5. The fact that Model 1 performs worse than HCMA on some of the problems is negligible as the losses only occur on rather simple – and thus, quickly solvable – problems.
The allocated budget for computing the landscape features was , i.e., between 100 and 500 function evaluations depending on the problem dimension. On the other hand, the ERT of the portfolio’s VBS lies between 4.4 and 7 371 411. These numbers (a) explain, why Model 1 performs poor on rather simple problems, and (b) support the thesis that the costs for computing the landscape features only have a small impact on the total costs when solving rather complex problems. However, one should be aware of the fact that the number of required function evaluations for the initial design is in a range of the common size of initial algorithm populations so that in practice no additional feature costs would occur.
These findings are also visible in the upper row of Figure 6, which compares the relERTs of HCMA and Model 1 (on a logscale) across all 96 BBOB problems (distinguished by FID and dimension). For better visibility of the performance differences, the area between the relERTs of the SBS (depicted by ) and Model 1 () is highlighted in grey. One major contribution of our selector obviously is to successfully avoid using HCMA on the threedimensional version of FID 4. Moreover, one can see that the SBS mainly dominates our selector on the separable BBOB problems (FIDs 1 to 5), whereas our selector achieved equal or better performances on the majority of the remaining problems.
Eight features resulted from the feature selection approach and were included in Model 1: three features from the ydistribution feature set (the skewness, kurtosis and number of peaks of a kerneldensity estimation of the problems’ objective values,
[6]), one levelset feature (the ratio of mean misclassification errors when using a LDA and MDA, [6]), two information content features (the maximum information content and the settling sensitivity, [17]), one cell mapping feature (the standard deviation of the distances between each cell’s center and worst observation,
[21]) and one of the basic features (the best fitness value within the sample).VC Improving the Currently Best Algorithm Selector
Although Model 1 already performs better than the SBS, we tried to improve the previous model by making use of results from previous works. More precisely, we start with the set of features that was employed by Model 1 and try to improve that model by adding the metamodel and nearest better clustering features. The rational behind this approach is that (a) these feature sets – as shown in [22] – are very useful for characterizing a landscape’s global structure, and (b) the feature set of Model 1 was derived from a greedy approach.
Based on the combination of the three feature sets from above, additional feature selections were performed. Two of the new selectors, namely the ones constructed using the greedy sfbsapproach (mean relERT: 14.27) and the GA (14.24) – performed better than Model 1. The latter model is from here on denoted as “Model 2”.
As displayed in Table I, Model 2 not only has the best overall performance, but also performs best on the multimodal functions with a weak global structure (FIDs 20 to 24). This effect is also visible in the bottom row of Figure 5, where the problems of the respective BBOB group (colored in pink) are located either in the left half of the respective plots (for 2D and 3D) or on the diagonal itself (5D and 10D).
Also, Figures 5 and 6 reveal that Model 2 differs more often from the SBS than Model 1. This finding is also supported by Table III, which summarizes how often each of the solvers was predicted and how often it was the best solver. Although HCMA is only in roughly 15% of the problems (14 of 96) the best choice, Model 1 selects it in approximately 80% and Model 2 in 46%. Interestingly, the two models only predict three and four different solvers, respectively.
The different behavior of the two selectors is likely caused by the differences in the used features. While three of the features are identical for both selectors – the previously described angle and levelset features, as well as the skewness of the objective values – the remaining six features of Model 2 are metamodel and nearest better clustering (NBC) features. The two meta model features measure (1) the smallest absolute, nonintercept coefficient of a linear model (without interactions) and (2) the adjusted model fit () of a quadratic model (without interactions) [6]. The remaining four NBC features measure (1) the ratio of the standard deviations of the distances among the nearest neighbours and the nearest better neighbours, (2) the ratio of the respective arithmetic means, (3) the correlation between the distances of the nearest neighbours and the nearest better neighbours, and (4) the socalled indegree, i.e., the correlation between fitness value and the count of observations to whom the current observation is the nearest better neighbour [22].
True Best Solver  Predicted Solver (Model 1 / Model 2)  

Solver  #  fmincon  HCMA  HMLSL  MLSL 
BSqi  6  3 / 2  3 / 2  0 / 2  0 / 0 
BSrr  6  1 / 2  5 / 4  0 / 0  0 / 0 
CMACSA  7  0 / 1  7 / 3  0 / 3  0 / 0 
fmincon  12  0 / 4  8 / 4  0 / 1  4 / 3 
fminunc  6  1 / 2  4 / 3  0 / 0  1 / 1 
HCMA  14  0 / 3  14 / 11  0 / 0  0 / 0 
HMLSL  11  3 / 3  7 / 4  0 / 0  1 / 4 
IPOP400D  7  0 / 0  7 / 3  0 / 3  0 / 1 
MCS  4  2 / 1  2 / 0  0 / 3  0 / 0 
MLSL  12  4 / 2  8 / 6  0 / 2  0 / 2 
OQNLP  6  0 / 1  6 / 2  0 / 2  0 / 1 
SMACBBOB  5  0 / 0  5 / 2  0 / 1  0 / 2 
96  14 / 21  76 / 44  0 / 17  6 / 14 
Vi Conclusion
We showed that sophisticated machine learning techniques combined with informative exploratory landscape features form a powerful tool for automatically constructing algorithm selection models for unseen optimization problems. In our setting of continuous blackbox optimization, we were able to reduce the mean relative ERT of the single best solver of our portfolio by half relying only on a quite small set of ELA features and function evaluations. A specific Rpackage enhanced by a userfriendly graphical user interface allows for transferring the presented methodology to other (continuous) optimization scenarios differing from the BBOB setting.
Of course, the quality and usability of such models heavily relies on the quality of the algorithm benchmark as well as on the representativeness of the included optimization problems. While we considered the wellknown and commonly accepted BBOB workshop setting, we are aware of possible shortcomings and will extend our research in terms of including other comprehensive benchmarks once available and practical applications.
Finally, we work on extending the methodological approach to constrained and multiobjective optimization problems in order to increase the applicability to realworld scenarios.
Acknowledgment
The authors acknowledge support from the European Research Center for Information Systems^{14}^{14}14https://www.ercis.org/ (ERCIS).
References
 [1] J. R. Rice, “The Algorithm Selection Problem,” Advances in Computers, vol. 15, pp. 65 – 118, 1975.

[2]
B. Bischl, O. Mersmann, H. Trautmann, and M. Preuss, “Algorithm Selection
Based on Exploratory Landscape Analysis and CostSensitive Learning,” in
Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO)
. ACM, 2012, pp. 313 – 320.  [3] M. A. Muñoz Acosta, Y. Sun, M. Kirley, and S. K. Halgamuge, “Algorithm Selection for BlackBox Continuous Optimization Problems: A Survey on Methods and Challenges,” Information Sciences, vol. 317, pp. 224 – 245, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025515003680
 [4] L. Kotthoff, P. Kerschke, H. H. Hoos, and H. Trautmann, “Improving the State of the Art in Inexact TSP Solving Using PerInstance Algorithm Selection,” in Proceedings of 9th International Conference on Learning and Intelligent Optimization (LION), ser. Lecture Notes in Computer Science, C. Dhaenens, L. Jourdan, and M.E. Marmion, Eds., vol. 8994. Springer, January 2015, pp. 202 – 217.
 [5] P. Kerschke, L. Kotthoff, J. Bossek, H. H. Hoos, and H. Trautmann, “Leveraging TSP Solver Complementarity through Machine Learning,” Evolutionary Computation Journal, (accepted).
 [6] O. Mersmann, B. Bischl, H. Trautmann, M. Preuss, C. Weihs, and G. Rudolph, “Exploratory Landscape Analysis,” in Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2011, pp. 829 – 836.

[7]
O. Mersmann, B. Bischl, H. Trautmann, M. Wagner, J. Bossek, and F. Neumann,
“A Novel FeatureBased Approach to Characterize Algorithm Performance for
the Traveling Salesperson Problem,”
Annals of Mathematics and Artificial Intelligence
, vol. 69, pp. 151 – 182, October 2013.  [8] F. Hutter, L. Xu, H. H. Hoos, and K. LeytonBrown, “Algorithm Runtime Prediction: Methods & Evaluation,” Artificial Intelligence Journal, vol. 206, pp. 79 – 111, 2014.
 [9] G. Ochoa, S. Verel, F. Daolio, and M. Tomassini, “Local Optima Networks: A New Model of Combinatorial Fitness Landscapes,” in Recent Advances in the Theory and Application of Fitness Landscapes, ser. Emergence, Complexity and Computation. Springer, 2014, pp. 233 – 262.
 [10] J. Pihera and N. Musliu, “Application of Machine Learning to Algorithm Selection for TSP,” in Proceedings of the IEEE 26th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE press, 2014.
 [11] F. Daolio, A. Liefooghe, S. Verel, H. Aguirre, and K. Tanaka, “Problem features vs. algorithm performance on rugged multiobjective combinatorial fitness landscapes,” Evolutionary Computation, 2016.
 [12] T. Jones and S. Forrest, “Fitness Distance Correlation as a Measure of Problem Difficulty for Genetic Algorithms,” in Proceedings of the 6th International Conference on Genetic Algorithms (ICGA). Morgan Kaufmann Publishers Inc., 1995, pp. 184 – 192. [Online]. Available: http://dl.acm.org/citation.cfm?id=657929
 [13] M. Lunacek and L. D. Whitley, “The Dispersion Metric and the CMA Evolution Strategy,” in Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2006, pp. 477 – 484.
 [14] K. M. Malan and A. P. Engelbrecht, “Quantifying Ruggedness of Continuous Landscapes Using Entropy,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2009, pp. 1440 – 1447.
 [15] C. L. Müller and I. F. Sbalzarini, “Global Characterization of the CEC 2005 Fitness Landscapes Using FitnessDistance Analysis,” in Proceedings of the European Conference on the Applications of Evolutionary Computation (EvoApplications), ser. Lecture Notes in Computer Science. Springer, April 2011, pp. 294 – 303.
 [16] T. Abell, Y. Malitsky, and K. Tierney, “Features for Exploiting BlackBox Optimization Problem Structure,” in Proceedings of 7th International Conference on Learning and Intelligent Optimization (LION), G. Nicosia and P. Pardalos, Eds. Springer, January 2013, pp. 30 – 36.
 [17] M. A. Muñoz Acosta, M. Kirley, and S. K. Halgamuge, “Exploratory Landscape Analysis of Continuous Space Optimization Problems Using Information Content,” IEEE Transactions on Evolutionary Computation, vol. 19, no. 1, pp. 74 – 87, 2015.
 [18] R. Morgan and M. Gallagher, “Analysing and Characterising Optimization Problems Using Length Scale,” Soft Computing, pp. 1 – 18, 2015.
 [19] K. M. Malan, J. F. Oberholzer, and A. P. Engelbrecht, “Characterising Constrained Continuous Optimisation Problems,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2015, pp. 1351–1358.
 [20] S. Shirakawa and T. Nagao, “Bag of local landscape features for fitness landscape analysis,” Soft Computing, vol. 20, no. 10, pp. 3787–3802, 2016.

[21]
P. Kerschke, M. Preuss, C. I. Hernández Castellanos, O. Schütze, J.Q.
Sun, C. Grimme, G. Rudolph, B. Bischl, and H. Trautmann, “Cell Mapping
Techniques for Exploratory Landscape Analysis,” in
EVOLVE – A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation V
. Springer, 2014, pp. 115 – 131.  [22] P. Kerschke, M. Preuss, S. Wessing, and H. Trautmann, “Detecting Funnel Structures by Means of Exploratory Landscape Analysis,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2015, pp. 265 – 272.
 [23] ——, “LowBudget Exploratory Landscape Analysis on Multiple Peaks Models,” in Proceedings of the 18th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, 2016.
 [24] G. VanRossum and The Python Development Team, The Python Language Reference – Release 3.5.0. Python Software Foundation, 2015.
 [25] MATLAB, Version 8.2.0 (R2013b). Natick, MA, USA: The MathWorks Inc., 2013.
 [26] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2017. [Online]. Available: http://www.Rproject.org/
 [27] P. Kerschke, flacco: FeatureBased Landscape Analysis of Continuous and Constrained Optimization Problems, 2017, Rpackage version 1.6. [Online]. Available: https://cran.rproject.org/package=flacco
 [28] P. Kerschke and H. Trautmann, “The RPackage FLACCO for Exploratory Landscape Analysis with Applications to MultiObjective Optimization Problems,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2016.
 [29] C. Hanster and P. Kerschke, “flaccogui: Exploratory Landscape Analysis for Everyone,” in Proceedings of the 19th Annual Conference on Genetic and Evolutionary Computation (GECCO) Companion. ACM, 2017.
 [30] J. Bossek, smoof: Single and MultiObjective Optimization Test Functions, 2016, Rpackage version 1.4. [Online]. Available: https://github.com/jakobbossek/smoof
 [31] N. Hansen, A. Auger, O. Mersmann, T. Tušar, and D. Brockhoff, “COCO: A Platform for Comparing Continuous Optimizers in a BlackBox Setting,” CoRR, vol. abs/1603.08785, 2016. [Online]. Available: http://arxiv.org/abs/1603.08785
 [32] N. Hansen, A. Auger, S. Finck, and R. Ros, “RealParameter BlackBox Optimization Benchmarking 2009: Experimental Setup,” INRIA, Tech. Rep. RR6828, 2009. [Online]. Available: https://hal.inria.fr/inria00362649v3/document
 [33] S. Finck, N. Hansen, R. Ros, and A. Auger, “RealParameter BlackBox Optimization Benchmarking 2010: Presentation of the Noiseless Functions,” University of Applied Science Vorarlberg, Dornbirn, Austria, Tech. Rep., 2010. [Online]. Available: http://coco.lri.fr/downloads/download15.03/bbobdocfunctions.pdf
 [34] B. Beachkofski and R. Grandhi, “Improved Distributed Hypercube Sampling,” in Proceedings of the 43rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, AIAA paper 20021274. American Institute of Aeronautics and Astronautics, 2002.
 [35] S. Swarzberg, G. Seront, and H. Bersini, “S.T.E.P.: The Easiest Way to Optimize a Function,” in Proceedings of the First IEEE Congress on Evolutionary Computation (CEC). IEEE, June 1994, pp. 519 – 524.
 [36] R. P. Brent, Algorithms for Minimization Without Derivatives. Courier Corporation, 2013.
 [37] P. Baudiš and P. Pošík, “Global Line Search Algorithm Hybridized with Quadratic Interpolation and its Extension to Separable Functions,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). New York, NY, USA: ACM, 2015, pp. 257 – 264.
 [38] P. Pošík and P. Baudiš, “Dimension Selection in AxisParallel BrentStep Method for BlackBox Optimization of Separable Continuous Functions,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). New York, NY, USA: ACM, 2015, pp. 1151 – 1158. [Online]. Available: http://pasky.or.cz/sci/dimsel.pdf
 [39] L. Pál, “Comparison of Multistart Global Optimization Algorithms on the BBOB Noiseless Testbed,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1153 – 1160. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0303pal.pdf
 [40] A. H. G. Rinnooy Kan and G. T. Timmer, “Stochastic Global Optimization Methods Part II: Multi Level Methods,” Mathematical Programming, vol. 39, no. 1, pp. 57–78, 1987.
 [41] C. G. Broyden, “The Convergence of a Class of DoubleRank Minimization Algorithms 2. The New Algorithm,” Journal of Applied Mathematics, vol. 6, no. 3, pp. 222 – 231, 1970.
 [42] W. Huyer and A. Neumaier, “Benchmarking of MCS on the Noiseless Function Testbed,” in Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2009. [Online]. Available: http://www.mat.univie.ac.at/%7Eneum/ms/mcs_exact.pdf
 [43] N. Hansen and A. Ostermeier, “Completely Derandomized SelfAdaptation in Evolution Strategies,” Evolutionary Computation Journal, vol. 9, no. 2, pp. 159 – 195, 2001.
 [44] A. Atamna, “Benchmarking IPOPCMAESTPA and IPOPCMAESMSR on the BBOB Noiseless Testbed,” in Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO). New York, NY, USA: ACM, 2015, pp. 1135 – 1142. [Online]. Available: https://www.lri.fr/%7Eatamna/atamna_BBOB_15.pdf
 [45] S. van Rijn, H. Wang, M. van Leeuwen, and T. Bäck, “Evolving the Structure of Evolution Strategies,” in Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, December 2016.
 [46] A. Auger, D. Brockhoff, and N. Hansen, “Benchmarking the Local Metamodel CMAES on the Noiseless BBOB’2013 Test Bed,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1225 – 1232. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0313auger.pdf
 [47] A. Auger and N. Hansen, “A Restart CMA Evolution Strategy with Increasing Population Size,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC), vol. 2. IEEE, September 2005, pp. 1769 – 1776.
 [48] I. Loshchilov, M. Schoenauer, and M. Sebag, “BiPopulation CMAES Algorithms with Surrogate Models and Line Searches,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1177 – 1184. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0306loshchilov.pdf
 [49] N. Hansen, “Benchmarking a BIPopulation CMAES on the BBOB2009 Function Testbed,” in Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (GECCO) Companion: Late Breaking Papers. ACM, July 2009, pp. 2389 – 2396.
 [50] I. Loshchilov, M. Schoenauer, and M. Sebag, “Intensive Surrogate Model Exploitation in Selfadaptive Surrogateassisted CMAES (saACMES),” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 439 – 446.
 [51] M. Powell, “The NEWUOA Software for Unconstrained Optimization Without Derivatives,” Largescale Nonlinear Optimization, pp. 255 – 297, 2006.
 [52] F. Hutter, H. H. Hoos, and K. LeytonBrown, “Sequential ModelBased Optimization for General Algorithm Configuration,” in Proceedings of 5th International Conference on Learning and Intelligent Optimization (LION), C. A. Coello Coello, Ed. Springer, January 2011, pp. 507 – 523.
 [53] F. Hutter, H. Hoos, and K. LeytonBrown, “An Evaluation of Sequential ModelBased Optimization for Expensive Blackbox Functions,” in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO). ACM, July 2013, pp. 1209 – 1216. [Online]. Available: http://coco.gforge.inria.fr/lib/exe/fetch.php?media=pdf2013:w0311hutter.pdf
 [54] C. E. Rasmussen and C. K. Williams, Gaussian Processes for Machine Learning. MIT press Cambridge, 2006, vol. 1.
 [55] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, “Lipschitzian Optimization without the Lipschitz Constant,” Journal of Optimization Theory and Applications, vol. 79, no. 1, pp. 157 – 181, 1993.
 [56] Z. Ugray, L. Lasdon, J. Plummer, F. Glover, J. Kelly, and R. Martí, “Scatter Search and Local NLP Solvers: A Multistart Framework for Global Optimization,” INFORMS Journal on Computing, vol. 19, no. 3, pp. 328 – 340, 2007.
 [57] M. Laguna and R. Martí, “The OptQuest Callable Library,” in Optimization Software Class Libraries. Springer, 2003, pp. 193 – 218.
 [58] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, Z. M. Jones, and G. Casalicchio, mlr: Machine Learning in R, 2016, Rpackage version 2.9. [Online]. Available: https://github.com/mlrorg/mlr
 [59] T. Therneau, B. Atkinson, and B. Ripley, rpart: Recursive Partitioning and Regression Trees, 2017, r package version 4.111. [Online]. Available: https://CRAN.Rproject.org/package=rpart
 [60] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis, “kernlab – An S4 Package for Kernel Methods in R,” Journal of Statistical Software, vol. 11, no. 9, pp. 1 – 20, 2004. [Online]. Available: http://www.jstatsoft.org/v11/i09/
 [61] A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2, no. 3, pp. 18 – 22, 2002. [Online]. Available: http://CRAN.Rproject.org/doc/Rnews/
 [62] T. Chen, T. He, M. Benesty, V. Khotilovich, and Y. Tang, xgboost: Extreme Gradient Boosting, 2017, Rpackage version 0.64. [Online]. Available: https://CRAN.Rproject.org/package=xgboost
 [63] S original by Trevor Hastie and Robert Tibshirani. Original R port by Friedrich Leisch and Kurt Hornik and Brian D. Ripley., mda: Mixture and Flexible Discriminant Analysis, 2016, Rpackage version 0.49. [Online]. Available: https://CRAN.Rproject.org/package=mda
 [64] B. Bischl, P. Kerschke, L. Kotthoff, T. M. Lindauer, Y. Malitsky, A. Fréchette, H. H. Hoos, F. Hutter, K. LeytonBrown, K. Tierney, and J. Vanschoren, “ASlib: A Benchmark Library for Algorithm Selection,” Artificial Intelligence Journal, vol. 237, pp. 41 – 58, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0004370216300388