# On the Impact of the Cutoff Time on the Performance of Algorithm Configurators

Algorithm configurators are automated methods to optimise the parameters of an algorithm for a class of problems. We evaluate the performance of a simple random local search configurator (ParamRLS) for tuning the neighbourhood size k of the RLS_k algorithm. We measure performance as the expected number of configuration evaluations required to identify the optimal value for the parameter. We analyse the impact of the cutoff time κ (the time spent evaluating a configuration for a problem instance) on the expected number of configuration evaluations required to find the optimal parameter value, where we compare configurations using either best found fitness values (ParamRLS-F) or optimisation times (ParamRLS-T). We consider tuning RLS_k for a variant of the Ridge function class (Ridge*), where the performance of each parameter value does not change during the run, and for the OneMax function class, where longer runs favour smaller k. We rigorously prove that ParamRLS-F efficiently tunes RLS_k for Ridge* for any κ while ParamRLS-T requires at least a quadratic one. For OneMax ParamRLS-F identifies k=1 as optimal with linear κ while ParamRLS-T requires a κ of at least Ω(n n). For smaller κ ParamRLS-F identifies that k>1 performs better while ParamRLS-T returns k chosen uniformly at random.

## Authors

• 3 publications
• 14 publications
• 23 publications
• ### Analysis of the Performance of Algorithm Configurators for Search Heuristics with Global Mutation Operators

Recently it has been proved that a simple algorithm configurator called ...
04/09/2020 ∙ by George T. Hall, et al. ∙ 0

• ### LeapsAndBounds: A Method for Approximately Optimal Algorithm Configuration

We consider the problem of configuring general-purpose solvers to run ef...
07/02/2018 ∙ by Gellért Weisz, et al. ∙ 0

• ### The Ridge Path Estimator for Linear Instrumental Variables

This paper presents the asymptotic behavior of a linear instrumental var...
08/25/2019 ∙ by Nandana Sengupta, et al. ∙ 0

• ### Partial Reinitialisation for Optimisers

Heuristic optimisers which search for an optimal configuration of variab...
12/09/2015 ∙ by Ilia Zintchenko, et al. ∙ 0

• ### GLASSES: Relieving The Myopia Of Bayesian Optimisation

We present GLASSES: Global optimisation with Look-Ahead through Stochast...
10/21/2015 ∙ by Javier Gonzalez, et al. ∙ 0

• ### Quantifying the Impact of Parameter Tuning on Nature-Inspired Algorithms

The problem of parameterization is often central to the effective deploy...
05/03/2013 ∙ by Matthew Crossley, et al. ∙ 0

• ### Real-time Bidding campaigns optimization using attribute selection

Real-Time Bidding is nowadays one of the most promising systems in the o...
10/29/2019 ∙ by Luis Miralles, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

General purpose heuristics, such as evolutionary algorithms, have the advantage that they can generate high quality solutions to optimisation problems without requiring much knowledge about the problem at hand. All that is required to apply a general purpose heuristic is a suitable representation for candidate solutions and a measure (the fitness function) that allows us to compare the quality of different solutions against each other. However, it is well understood that different design choices and different settings of their numerous parameters (e.g., mutation rate, crossover rate, selective pressure and population size for generational genetic algorithms (GAs)) may considerably affect their performance and in turn the quality of the identified solutions. In particular, the capability of heuristics to identify high quality solutions in a short time depends crucially on the use of suitable parameter settings

(paper:EibenParameterControl, ). Traditionally the design and parameter tuning of the algorithm for the problem at hand has mainly been done manually. Typically, the developer chooses some algorithmic designs and values for the associated parameters and executes them on instances of the problem. Refinements are then made according to how well each algorithm/parameter configuration has performed. However, such a procedure (or a similar one) is a time-consuming and error-prone process. From a scientific research point of view, it is also biased by personal experience hence difficult to reproduce. Consequently it has become increasingly common to use automated and principled methodologies for algorithm development. In the literature, researchers have typically referred to the automated optimisation of algorithm performance as automated parameter tuning and automated algorithm configuration (chap:stutzle_lopez_ibanez, ). Recently more ambitious methodologies have emerged such as automated construction of heuristic algorithms (SATenstein, ; paper:Fukunaga2008, ) automated algorithm generation (paper:paramILS, ) and hyper-heuristics (BurkeEtAl2013, ). Although automating the algorithmic design has gained significant momentum in recent years, the idea has been around for over thirty years. In 1986 Grefenstette used a GA to optimise the parameters of another GA (paper:meta_GA_param_tuning, ). Since then several other heuristic methodologies have been employed to optimise algorithmic parameters including hill-climbing (paper:analysis_learning_plan_search_problem, ), beam search (paper:integrating_heuristics_constraint_satisfaction_probs, ), iterated local search (ParamILS) (paper:paramILS, ), gender-based GAs (paper:gender_based_GA_param_tuner, ) and more traditional GAs (EVOCA) (paper:new_algo_reduce_metaheuristic_effort, ). Recently more sophisticated methodologies have appeared based on racing (paper:racing_introduced, ) approaches for comparing several configurations in parallel and integrating statistical testing methods (paper:f_race_introduced, ). These include the popular irace configurator (paper:irace, ). Also surrogate models have been introduced to predict the computational cost of testing specific configurations in order to avoid poor choices. Popular examples of surrogate-based configurators are sequential parameter optimisation (SPOT) (paper:SPO, ; paper:SPOT, ) and the sequential model-based algorithm configuration (SMAC) (paper:ROAR_and_SMAC, ). While varying in several algorithmic details, all algorithm configurators generally aim to evolve better and better parameter values by evaluating the performance of candidate configurations on a training set of instances and using some perturbation mechanism (e.g., iterated local search in ParamILS or updating the sampling distributions in irace) to generate new ones based on the better performing ones in the previous generation. The overall aim is that the ultimately identified parameter values perform well (generalise) on unseen instances of the tackled problem. Many of the mentioned algorithm configurators have gained widespread usage since they have often identified better parameter values compared to carefully chosen default configurations (paper:irace, ; paper:paramILS, ; paper:SPO, ; paper:SPOT, ; paper:ROAR_and_SMAC, ). Despite their popularity, there is a lack of theoretical understanding of such configurators. For instance, it is unclear how good the identified parameters are compared to optimal ones for a given target algorithm and optimisation problem. In particular, if optimal parameter values may be identified by a given configurator, no indications are available regarding how large the total tuning budget should be for the task. Similarly, it is unclear how long should each configuration be run for (i.e., cutoff time) when evaluating its performance on a training set instance. In this paper, we take a first step towards establishing a theoretical grounding of algorithm configurators. Similarly to the time complexity analysis of other fields (AugerDoerr, ) we begin by analysing simplified algorithms and problems with the aim of building up a set of mathematical techniques for future analyses of more sophisticated systems and to shed light on for which classes of problems more sophistication is required for good performance. We consider a simple hillclimbing tuner, which we call ParamRLS because it is a simplified version of the popular ParamILS tuner. The tuner mutates the value of one of its parameters chosen uniformly at random to create an offspring configuration which will be accepted if it performs at least as well as its parent on the training set. Regarding configuration performance evaluations, we consider two versions of ParamRLS. One, ParamRLS-T, compares the average runtimes required by the different configurations to identify the optimal solution of the target instances. If the instance is not solved by a configuration, then the cutoff time is returned multiplied by a penalty factor called penalisation constant. This performance measure originates in the SAT community, where it is called penalised average runtime (PAR) (SATenstein, ). The other version, ParamRLS-F, compares the number of times that solutions of better fitness are identified within the cutoff time by the different configurations and breaks ties by preferring the configuration that took less time to identify them. We analyse time-based comparisons because they are typically used in ParamILS, and are also available in SMAC and irace. We compare them with the latter strategy. While the tuner is very simple, the mathematical methods developed for its analysis are quite sophisticated and can be built upon for the analysis of more complicated algorithm configurators since the performance comparison of (at least) two parameter configurations is at the heart of virtually any parameter tuner. To the best of our knowledge, this is the first time that a rigorous time complexity analysis of algorithm configurators has been performed. The only related theoretical work regards the performance analysis of (online) parameter control of randomised search heuristics during the function optimisation phase (AlanaziLehre2014, ; DoerrEtAl2016B, ; DLOW2018, ; LehreOzcan2013, ; LissovoiEtAl2019, ; QianEtAl2016, ; LOWGecco2017, ; LOWArxiv2018, )

. We will analyse the number of iterations required by ParamRLS to identify optimal parameter values with overwhelming probability ()

111We say that a probability is overwhelming if it is at least for some constant . We frequently use that by a union bound, any polynomial number of events that all occur occur together with overwhelming probability. for the randomised local search (RLS) algorithm, where , the only parameter, is the local search neighbourhood size (i.e., bits are flipped without replacement in each iteration). Our aim is to characterise the impact of the cutoff time on the performance of the tuner. We will perform the analysis for two well-known black-box benchmark function classes: a modified version of Ridge (called [)] and OneMax222The OneMax function class consists of functions over each with a different global optimum and for each function the fitness decreases with the Hamming distance to the optimum. (DrosteJansenWegener2002, ). Since for both function classes, a given parameter configuration will have the same performance for all instances, these classes allow us to avoid the problems of deciding how many instances should be used in the training set (i.e., one instance suffices) and of evaluating the generalisation capabilities of the evolved parameters (i.e., the performance will be the same for all instances). Hence, we can concentrate on the impact of the cutoff time in isolation. The two function classes have different characteristics. For [,] each parameter value has the same improvement probability independent of the position of the candidate solution in the search space. For OneMax, it is better to flip fewer bits the closer the candidate solution is to the optimum. This implies that for the optimal parameter value is the same independent of how long the algorithm is run for i.e., will have better performance even for very small cutoff times as long as a sufficient number of comparisons between different configurations are performed. For OneMax, short runs of RLS with larger values of finds better solutions, whereas for longer runs smaller values of perform better. Our analysis shows that ParamRLS-F can efficiently identify that is the optimal parameter value for independent of the cutoff time as long as the performance for each parameter configuration is evaluated a sufficient number of times. For OneMax, instead, ParamRLS-F identifies that is the optimal parameter for any cutoff time greater than . If the cutoff time is considerably smaller, then ParamRLS-F will identify that the optimal value is . On the other hand, ParamRLS-T returns a parameter value chosen uniformly at random for any function containing up to an exponential number of optima if the cutoff time is smaller than . We show that for the cutoff time for ParamRLS-F has to be at least quadratic in the problem size. This paper is split into three sections. In Section 2, we describe the algorithm configuration problem, the algorithms and the function classes considered in this paper. We analyse ParamRLS tuning RLS for Ridge* and OneMax in Sections 3 and 4, respectively. Some proofs are omitted from the main part of the paper due to space restrictions. The omitted proofs from the main part can be found in the appendix.

## 2. Preliminaries

### 2.1. The Algorithm Configuration Problem

Informally, given an algorithm , its set of parameters and an optimisation problem , the algorithm configuration problem is that of identifying the set of parameter values for which achieves best performance on . We call the algorithm solving the configuration problem the configurator and the algorithm to be tuned () the target algorithm333Note that throughout the paper we use the terms configurator and tuner interchangeably.. More formally, we use to denote the parameter configuration space of (i.e., the search space of all feasible parameter configurations) and we denote a specific configuration by . The performance of different configurations for the problem is evaluated on a training set of instances which should be representative of the problem. Finally, let be a measure of the performance of running over the training set . Then the algorithm configuration problem is that of finding

 θ∗∈argminθ∈Θcost(θ)

The

function estimates the performance of algorithm

on a training set of problem instances . To do so the following decisions need to be made:

• Which instances (and how many) should be used in the training set ;

• Cutoff time : the amount of time that the algorithm is run on each instance ;

• Runs : the number of times the evaluation (of duration ) should be repeated for each instance ;

• : the quantity that is measured to evaluate how well performs on each ;

• How to aggregate the measure of performance over all instances.

Since for the two instance classes considered in this paper (see Section 2.4) one random instance suffices for perfect generalisation444Perfect generalisation means that the algorithm configuration will work equally well on problem instances that are not in the training set., we do not need to worry about the choice of the training set nor how to aggregate performances over it. We will consider two different metrics:

1. The time required for to find the optimal solution of an instance . If the optimum is not found before the cutoff time , then is taken as the time to reach the optimum, where is a penalty constant. This metric is commonly used in ParamILS (paper:paramILS, ).

2. The fitness of the best solution found within the cutoff time.

Let be the number of tested configurations before the optimal configuration is identified. We call this the number of evaluated configurations, or the number of evaluations. Then the total tuning time will be . Our aim in this paper is to estimate, for each metric, how the cutoff time and the number of runs impact the number of evaluated configurations and the total tuning time for a simple configurator called ParamRLS.

### 2.2. The Configurator: ParamRLS

We design our simple configurator following the framework laid out for ParamILS (paper:paramILS, ):

1. Initialise the configurator with some initial configuration ;

2. mutate by modifying a single parameter and accept the new configuration if it results in improved performance;

3. repeat Step 2 until no single parameter change yields an improvement.

Essentially we follow the above scheme where we initialise the configurator choosing a configuration uniformly at random from and we change the acceptance criterion to accept a new configuration if it performs at least as well as its parent. Note that we occasionally refer to the current value of in Algorithm 1 as the active parameter. Concerning Step 2, ParamILS applies an Iterated Local Search procedure. We instead consider the following two more simple random local search operators and, thus, call the algorithm ParamRLS:

• : the chosen parameter value is increased or decreased by 1 uniformly at random;

• : the chosen parameter value is increased or decreased by 1 or by 2 uniformly at random.

The first operator has previously been analysed for the optimisation of functions defined over search spaces with larger alphabets than those that can be represented using bitstrings (paper:DoerrDoerrKoetzing16, ). The second one slightly enlarges the neighbourhood size. For both operators we use the interval-metric such that any mutation that oversteps a boundary is considered infeasible. The resulting configurator is described in Algorithm 1. The termination condition may be either a predetermined number of iterations without a change in configuration (i.e., the solution is likely a local or global optimum) or a fixed number of iterations. In this paper we calculate the number of iterations until the configurator identifies the optimal configuration and will not leave it with overwhelming probability, hence we also provide bounds on the termination criterion. If the configurator uses the fitness-based metric for performance evaluation described in the previous section, then we will call the algorithm ParamRLS-F while if it uses the time-based metric, then we will refer to it as ParamRLS-T. The two evaluation procedures are described respectively in Algorithm 2 and in Algorithm 3. In Algorithm 3, we denote the capped optimisation time for on with cutoff time and penalty constant as CappedOptTime.

### 2.3. The Target Algorithm: RLS\boldmath{k}

In this paper we will evaluate the ParamRLS configurator for tuning the RLS algorithm which has only one parameter . RLS differs from conventional RLS in that the latter flips exactly one bit per iteration whereas RLS flips exactly bits per iteration, selected without replacement. Our aim is to identify the time required by our simple tuner to identify the best value for the parameter . We provide the pseudocode for RLS in Algorithm 4. We define the permitted values for as the range .

### 2.4. The Function Classes Ridge* and OneMax

We will analyse the performance of ParamRLS for tuning RLS for two optimisation problems with considerably different characteristics. One where the performance of each parameter configuration does not change throughout the search space and another where according to the cutoff times different configurations will perform better. For the first problem we consider a modified version of the standard Ridge benchmark problem (DrosteJansenWegener2002, ). The conventional Ridge function consists of a gradient of increasing fitness with the increase of the number of 0-bits in the bitstring that leads towards the bit string (i.e., ZeroMax). From there a path of points, consisting of consecutive 1-bits followed only by 0-bits, may be found that leads to the global optimum (i.e. the bit string). To achieve the sought behaviour and at the same time simplify the analysis, we remove the ZeroMax part by assuming that the algorithm is initialised in the bit string. This technique was used by Jansen and Zarges in order to simplify their early fixed budget analyses (paper:fixed_budget_analysis, ). As a result any bit string not in the form will be rejected. We call our modified function Ridge*:

 {Ridge*}(x)={i, if x in form 1i0n−i−1, otherwise

Since we are using RLS to optimise Ridge*, it will not always be possible to reach the optimum (i.e. ). The optimal value of Ridge* which we are able to reach when using RLS is in fact . In this work, we will consider reaching this value as having optimised the function. The black box optimisation version of Ridge* consists of functions. For each the fitness of a solution for the corresponding function can be calculated using the following XOR transformation: Ridge Ridge (DrosteJansenWegener2002b, ). For convenience of analysis we will use the Ridge function displayed above where the path starts in the bit string and terminates in the bit string. The best parameter value for RLS for a random instance will naturally be optimal also for any other instance of the black box class. The second optimisation problem we will consider is the well-studied OneMax benchmark function. Its black box class consists of functions each of which has a different bit string as global optimum and the fitness of each other bit string decreases with the Hamming distance to the optimum. We tune the parameter for only one instance since the identified optimal parameter will naturally also be the best parameter for any of the other instances. In particular, we will use the instance: .

### 2.5. A General Result for ParamRLS-T

In this section we show that for ParamRLS-T the cutoff time has to be at least superlinear in the instance size or it will not work. We can show that, for any and any function with up to an exponential number of optima, ParamRLS-T with overwhelming probability will return a parameter value chosen uniformly at random, for any polynomial number of evaluations and runs per evaluation. In Section 3 we will show that has to be at least quadratic for ParamRLS-T to identify the optimal configuration of RLS for Ridge*.

###### Theorem 2.1 ().

For RLS on any function with up to optima, ParamRLS-T with cutoff time , local search operator or , and any polynomial number of evaluations and runs per evaluation , will return a value for chosen uniformly at random, with overwhelming probability.

###### Proof.

Note that RLS belongs to the class of unary unbiased black-box algorithms as defined in (Lehre2012, ). Then (paper:parallel_black_box_complexity_tail_bounds, , Theorem 20) (applied with ) tells us that all RLS algorithms require at least iterations to reach the optimum, with probability . By the union bound, the probability that none of the total runs of RLS reaches the optimum within iterations is at least , which is again overwhelming for any polynomial choices of and . This implies that the tuner has no information to guide the search process, and therefore accepts the new value of with probability 0.5. It is easy to show that the tuner returns a value for uniformly at random. ∎

## 3. ParamRLS for RLS\boldmath{k} and Ridge*

In this section we will prove that ParamRLS-F identifies the optimal parameter for RLS and for any cutoff time. If the cutoff time is large enough i.e., , then even just one run per configuration evaluation suffices. For smaller cutoff times, ParamRLS-F requires more runs per configuration evaluation to identify that RLS is better than any other RLS for . We will show this for the extreme case for which runs per evaluation suffice for ParamRLS-F to identify the correct parameter w.o.p. On the other hand, ParamRLS-T will return a random configuration for any . The range of parameter values goes up to ; larger values of degrade to random search.

### 3.1. Analysis of RLS\boldmath{k} on Ridge*

In this section we analyse how the performance of RLS for changes with the parameter .

###### Lemma 3.1 ().

For , the expected optimisation time of on is .

###### Proof.

During a single iteration, it is only possible to increase the fitness of an individual by exactly since we must flip exactly the first zeroes in the bit string (any other combination of flips will mean that the string is no longer in the form and will be rejected). We call an iteration in which we flip exactly the first zeroes in the bit string a leap. There are possible ways in which we can flip bits and exactly one of these combinations flips the first zeroes. Therefore the probability of making a leap at any time is . By the waiting time argument, we wait iterations in expectation to make a single leap. Since we need to make leaps in order to reach the optimum, we wait iterations in expectation until we reach the optimum. ∎

###### Corollary 3.2 ().

A value of leads to the shortest expected optimisation time for on for any .

The optimisation time is also highly concentrated around the expectation, with deviations by (say) a factor of 2 having an exponentially small probability. The following lemma follows directly from Chernoff bounds.

###### Lemma 3.3 ().

With probability at least , RLS requires at least and at most iterations to optimise [.]

We can now consider the relative performance of RLS and RLS on Ridge*, for some . We first derive a general bound which can be applied to any two random processes with probabilities of improving which stay the same throughout the process. We derive a lower bound on the probability that the process with the higher probability of improving is ahead at some time . We apply this to RLS and RLS for Ridge*.

###### Lemma 3.4 ().

Let and be two random processes which both take values from the non-negative real numbers, and both start with value 0. At each time step, increases by some real number with probability , and otherwise stays put. At each time step, increases by some real number with probability , and otherwise stays put. Let and denote the total progress of and in steps, respectively. Let , , and . Then, for all and

 \prob(Δbt≥Δat)≤exp(−qt(1−2qα/(α+β)bqβ/(α+β)a))
###### Proof.

Let be the probability that exactly one process makes progress in a single time step. Let be the conditional probability of making progress, given that one process makes progress, and define likewise. Assume that in steps we have progressing steps. Then the probability that makes at least as much progress as is . Then,

 (1) \prob(Δbt≥Δat)=t∑ℓ=0\prob(\Bin(t,q)=ℓ)⋅\prob(\Bin(ℓ,qb)≥⌈ℓα/(α+β)⌉)

Note that is equivalent to . Thus, . Hence

 \prob(\Bin(ℓ,qb)≥⌈ℓα/(α+β)⌉)=ℓ∑i=⌈ℓα/(α+β)⌉(ℓi)qibqℓ−ia = ℓ∑i=⌈ℓα/(α+β)⌉(ℓ)iqℓα/(α+β)bqℓ−(ℓα/(α+β))a(qb/qa)i−(ℓα/(α+β)) ≤ 2ℓqℓα/(α+β)bqℓ−(ℓα/(α+β))a=(2qα/(α+β)bqβ/(α+β)a)ℓ.

Using the above in (1) and yields,

 \prob(Δbt≥Δat)≤ t∑ℓ=0(tℓ)qℓ(1−q)t−ℓ⋅(2qα/(α+β)bqβ/(α+β)a)ℓ = t∑ℓ=0(tℓ)(1−q)t−ℓ⋅(2q⋅qα/(α+β)bqβ/(α+β)a)ℓ (using the Binomial Theorem) = (1−q+2q⋅qα/(α+β)bqβ/(α+β)a)t = (1−q(1−2qα/(α+β)bqβ/(α+β)a))t ≤ exp(−qt(1−2qα/(α+β)bqβ/(α+β)a)).\qed

Applying this lemma allows us to derive a lower bound on the probability that wins against RLS () with a cutoff time of . Additional arguments for small show that the probability that RLS wins is always at least .

###### Lemma 3.5 ().

For every , in an evaluation with a single run on Ridge* with cutoff time , RLS wins against RLS with probability at least

 max{12, 1−exp(−κ/(na)⋅(1−o(1)))−exp(−Ω(n/b))}

### 3.2. ParamRLS-F Performance Analysis

Using the above lemmas, we now consider the cutoff time required before ParamRLS returns in expectation. The following theorem shows that one run per configuration evaluation suffices for large enough cutoff times. Note that it is not sufficient for the active parameter merely to be set to the value 1, since it is still possible for it to then change again to a different value. We therefore require that the active parameter remains at 1 for the remainder of the tuning time. We calculate this probability in the same theorem.

###### Theorem 3.6 ().

ParamRLS-F for RLS on Ridge* with , cutoff time , local search operator and any initial parameter value, in expectation after at most evaluations with a single run each has active parameter . If ParamRLS-F runs for evaluations, then it returns the parameter value with probability at least .

###### Proof.

By Lemma 3.5, the probability that RLS beats RLS in an evaluation with any cutoff time is at least . We can therefore model the tuning process as the value of the active parameter performing a lazy random walk over the states . We pessimistically assume that the active parameter decreases and increases by with respective probabilities and that it stays the same with probability . Using standard random walk arguments (Feller1968, ; Feller1971, ), the expected first hitting time of state 1 is at most . By Markov’s inequality, the probability that state 1 has not been reached in steps is at most . Hence the probability that state 1 is not reached during periods each consisting of steps is . Once state 1 is reached, we remain there unless RLS beats RLS in a run. By Lemma 3.5, this event happens in a specific evaluation with probability at most . By a union bound over at most evaluations, the probability that this ever happens is at most . ∎

We now show that even for extremely small cutoff times i.e., , the algorithm can identify the correct configuration as long as sufficient number of runs are executed per configuration evaluation.

###### Theorem 3.7 ().

Consider ParamRLS-F for RLS on Ridge* with evaluations, each consisting of runs with cutoff time . Assume we are using the local search operator . In expectation the tuner requires at most evaluations in order to set the active parameter to . If the tuner is run for evaluations then it returns the value with probability at least

 1−2−Ω(T/ϕ2)−T⋅(2−Ω(κ/n)+2−Ω(n)).
###### Proof.

Define as the number of runs out of runs, each with cutoff time , in which RLS makes progress. Define as the corresponding variable for RLS. Let . By Chernoff bounds, we can show that . We can also show that, again by Chernoff bounds, . Therefore, with overwhelming probability, RLS has made progress in more of these runs than RLS. That is, with overwhelming probability, RLS wins the evaluation. It is easy to show that, for , RLS beats RLS with probability at least . This means that we can make the same pessimistic assumption about the progress of the value of the active parameter as we do in the proof of Theorem 3.6. The remainder of the proof is identical. ∎

### 3.3. ParamRLS-T Performance Analysis

We conclude the section by showing that, unless the cutoff time is large, ParamRLS-T returns a value of chosen uniformly at random for RLS and Ridge*.

###### Theorem 3.8 ().

Consider ParamRLS-T for RLS on Ridge* with , local search operator or , cutoff time , and evaluations consisting each of runs. With overwhelming probability, for any polynomial choices of and , the tuner will return a value for chosen uniformly at random.

###### Proof.

For all , we have . By Lemma 3.3 with probability at least , no RLS with will have reached the optimum of Ridge* within iterations. Thus, with probability at least , no configuration reached the optimum of Ridge* in any of the runs in any of the evaluations. In this case, we can simply use the random walk argument as used in the proof of Theorem 3.6, but in this case the value of the active parameter will not settle on , meaning that ParamRLS-T will return a value for chosen uniformly at random. ∎

## 4. ParamRLS for RLS\boldmath{k} and OneMax

In this section we analyse the performance of ParamRLS when configuring RLS for OneMax. If RLS is only allowed to run for few fitness function evaluations, then the algorithm with larger parameter values for performs better than with smaller ones. On the other hand, if more fitness evaluations are allowed, then RLS will be the fastest at identifying the optimum (DoerrYangArxiv, ). Our aim is to show that ParamRLS-F can identify whether is the optimal parameter choice or whether a larger value for

performs better according to whether the cutoff time is small or large. Hence, to prove our point it suffices to consider the configurator with the following parameter vector:

which also simplifies the analysis. We will prove that ParamRLS-F identifies that is optimal for any even for single runs per configuration evaluation. This time is shorter than the expected time required by any configuration to optimise OneMax (i.e., ) (Lehre2012, ). If, instead, the cutoff time is smaller than , then ParamRLS-F will identify that is a better choice, as desired. The following lemma gives bounds on the expected progress towards the optimum in one step.

###### Lemma 4.1 ().

The expected progress of RLS with current distance  to the optimum is

 Δk(s)=k∑i=⌊k/2⌋+1(2i−k)⋅(si)(n−sk−i)/(nk)

In particular, for ,

 Δ1(s)= sn Δ2(s)= 2s(s−1)n(n−1)≤2(sn)2Δ3(s)=3s(s−1)n(n−1)≤3(sn)2 Δ4(s)= 8s(s−1)(s−2)(n−s/2−3/2)n(n−1)(n−2)(n−3)≤8(sn)3 Δ5(s)= 10s(s−1)(s−2)(n−s/2−3/2)n(n−1)(n−2)(n−3)≤10(sn)3.

It is well known that RLS has the lowest expected optimisation time on OneMax for all RLS. It runs in expected time , which is best possible for all unary unbiased black-box algorithms (Doerr:2016:OPC:2908812.2908950, ; DoerrYangArxiv, ) up to terms of . It is also known (Doerr:2016:OPC:2908812.2908950, ; DoerrYangArxiv, ) that, regardless of the fitness of the individual, flipping bits never gives higher expected drift than flipping bits (for any positive integer ). For this reason, we use the local search operator .

### 4.1. k=1 is Optimal for Large Cutoff Times

For large cutoff times, ParamRLS-F is able to identify the optimal parameter value . The analysis is surprisingly challenging as most existing methods in the runtime analysis of evolutionary algorithms are geared towards first hitting times. Results on the expected fitness after a given cutoff time (fixed-budget results) are rare (paper:fixed_budget_analysis, ; paper:fixed_budget_linear_funcs, ; Doerr2013c, ; Jansen2014, ; Nallaperuma2017, ) and do not cover RLS for . The following lemma establishes intervals such that the current distance to the optimum is contained in these intervals with overwhelming probability.

###### Lemma 4.2 ().

Consider RLS on OneMax with a cutoff time . Divide the first generations into 80 periods of length each. Define and and, for all ,

 ℓi= ℓi−1−n20Δk(ℓi−1)−o(n)\enskip and% \enskipui=ui−1−n20Δk(ℓi)+o(n).

Then, with overwhelming probability at the end of period  for , the current distance to the optimum is in the interval and throughout period , , it is in the interval .

###### Proof.

We prove the statement by induction. At time 0, the current distance to the optimum is in with overwhelming probability by Chernoff bounds. Now assume that at the end of period , the current distance is in . In order to determine the next lower bound on the distance, we temporarily assume that at the end of period , we are precisely at distance . This assumption is pessimistic here since starting period  closer to the optimum can only decrease the distance to the optimum at the end of period . During period , since the current distance can only decrease and the expected progress is non-decreasing in the distance, the expected progress in each step is at most . By the method of bounded martingale differences (paper:scheideler_hab_thesis, , Theorem 3.67), the total progress in steps is thus at most with probability

 1−exp(−((n/20)3/4)2/(2k2n/20))=1−exp(Ω(−n1/2)).

Hence we obtain as a lower bound on the distance at the end of period , with overwhelming probability. While the distance in period  is at least , the expected progress in every step is at least . Again using the method of bounded differences, by the same calculations as above, the progress is at least with overwhelming probability. This establishes as an upper bound on the distance at the end of period . Taking the union bound over all failure probabilities proves the claim. ∎

Iterating the recurrent formulas from Lemma 4.2 shows the following.

###### Lemma 4.3 ().

After steps, RLS is ahead of RLS and RLS by a linear distance: and respectively. Furthermore, RLS is ahead of RLS and RLS by a linear distance: and respectively. And the distance to the optimum is at most for RLS, RLS and RLS.

We conclude that for every , smaller parameters win with overwhelming probability.

###### Theorem 4.4 ().

For every cutoff time , with overwhelming probability RLS beats RLS as well as RLS and RLS beats RLS as well as RLS.

###### Proof.

Lemma 4.3 proves the claim for a cutoff time of . For larger cutoff times, it is possible for the algorithms that lag behind to catch up. To this end, we define the distance between two algorithms RLS, RLS with as , where and refer to the respective distances to the optimum at time . Initially we have for all considered algorithm pairs. We then show that, as long as , the distance has a tendency to increase. We then apply the negative drift theorem (Oliveto2011, ; Oliveto2012Erratum, ) in the version for self-loops (Rowe2013, ) to show that with overwhelming probability does not drop to 0 until RLS has found an optimum (). Details are omitted due to space restrictions. ∎

We are now able to derive the expected number of evaluations required for the tuner to return for RLS on OneMax with a large enough cutoff time (for these results to hold, we assume that we use a local search operator of ).

###### Theorem 4.5 ().

For ParamRLS-F tuning RLS for OneMax, with cutoff time , , local search operator , evaluations and runs per evaluation, with and both polynomial, then in expectation we require at most 8 evaluations before the active parameter is set to for the first time. If for some constant then the tuner returns the parameter

###### Proof.

We use a similar technique to that used in the proof of Theorem 3.6. In this case, however, we split the state space of the value of the active parameter into just three states: , , and . We know from Theorem 4.4 that RLS beats RLS and RLS with overwhelming probability in a run with cutoff time . Let us assume that this always happens. Then the transition probability from state to state is at least , since this is the probability that we evaluate RLS against RLS or RLS against RLS. In all other cases, depending on whether RLS beats RLS, we either move to state or stay in state . By a similar argument, the transition probability from state to state is at least , and with probability at most we remain in state . Therefore, in the worst case (where the initial choice for the parameter puts us in state ), we require, in expectation, at most 8 evaluations before we hit state

. A Chernoff bound for geometric random variables

(chapter:doerr_tools_from_prob_theory, , Theorem 1.14) tells us that the probability that we require more than evaluations to hit state when starting from state is at most . If for some constant then evaluations are sufficient. Recall that we still need the probability that we remain in state after hitting it for the first time. In the worst case, this means that we require that RLS beats RLS or RLS for all runs within the tuning process. Recall that RLS beats RLS and RLS beats RLS. By Theorem 4.4 and the definition of overwhelming probabilities, the probability that we remain in state after hitting it for the first time is therefore at least for some constant . ∎

### 4.2. k>1 is Optimal for Small Cutoff Times

We now show that if the cutoff time is small, then ParamRLS-F identifies that is not optimal anymore as desired.

###### Lemma 4.6 ().

For cutoff time the probability that RLS beats RLS is at most . The same holds for the probability that RLS beats RLS. 555Note that the result is only meaningful for as otherwise we get a trivial probability bound of

###### Proof.

Let be the distance to the optimum in RLS and be the distance to the optimum in RLS at time . Let be a constant chosen later, then by Chernoff bounds,

 \prob(s0,1,s0,3∈[(n−εκ)/2,(n+εκ)/2])≥1−4e−Ω(κ2/n)

We assume in the following that this is the case. Then RLS wins if in steps RLS’s progress exceeds that of RLS by at least . Define to be the difference in the progress values made by the two algorithms. Along with the drift bounds from Lemma 4.1,

 \E(Dt)=st,3n⋅3(st,3−1)n−1−st,1n=3(st,3/n)2−st,1/n−O(1/n).

Note that the leading constant in is chosen as . This implies that for we always have and . We bound the latter using and if we choose small enough, we have . Using these inequalities,

 \E(Dt)≥ 3(1/√6+ε)2−(1/2+ε)−O(1/n) = 1/2+√6ε+3ε2−1/2−ε−O(1/n)≥(√6−1)ε−O(1/n).

Now, for , using we derive . By the method of bounded differences (paper:scheideler_hab_thesis, , Theorem 3.67), this is at most . ∎

###### Theorem 4.7 ().

When tuning RLS for OneMax, the probability that ParamRLS-F with cutoff time , local search operator or and returns the value , for any number of evaluations , is at most .

###### Proof.

In order for ParamRLS-F to return a value of , it is necessary for RLS to beat either RLS or RLS in at least one evaluation. In the best case scenario, each evaluation in the tuning process will be either RLS or RLS against RLS, since this maximises the number of opportunities in which RLS has to win one of these evaluations. Using the upper bounds on the probabilities of RLS beating RLS and RLS (see Lemma 4.6), the union bound tells us that the probability that RLS wins any one of these evaluations is at most . ∎

## 5. Conclusions

We have shown that the cutoff time only slightly impacts the performance of ParamRLS-F. ParamRLS-F can identify that is the optimal parameter value for both optimisation problems for large enough cutoff times. Surprisingly, for such cutoff times, a single run per configuration evaluation is sufficient to achieve the desired results. While we do not expect this to be the case for harder optimisation problems, it is promising that for the simple unimodal problems considered herein multiple configuration evaluations are not necessary. Furthermore the required cutoff times of and , respectively for Ridge* and OneMax, are considerably smaller than the expected time for any parameter configuration to optimise either problem (i.e., and respectively for the best configuration ()). On the other hand, if the cutoff times are small ParamRLS-F identifies that for Ridge* the optimal parameter value is still as long as sufficient runs are performed to evaluate the performance of parameter configurations. We prove this effect for the extreme value for which runs suffice to always identify the better configuration w.o.p. Note that runs lasting one generation each are still considerably smaller than the time required for any configuration to identify the optimum of Ridge*. Concerning OneMax, instead, for cutoff times smaller than we proved that ParamRLS-F identifies that is not the best parameter, as desired (i.e., RLS will produce better solutions than RLS if the time budget is small). The impact of the cutoff time on ParamRLS-T, instead, is very big. The configurator cannot optimise the single parameter of RLS applied to any function, even functions with up to exponentially many optima, if the cutoff time is smaller than independent of the number of runs per configuration evaluation. For small cutoff times, even if the tuner happens to set the active parameter to the optimal value, it will not be identified as optimal, making it unlikely that it stays there for the remainder of the tuning process. For the unimodal Ridge* function at least a quadratic cutoff time is required.

#### Acknowledgements

This work was supported by the EPSRC under grant EP/M004252/1.

## References

• [1] Fawaz Alanazi and Per Kristian Lehre. Runtime analysis of selection hyper-heuristics with classical learning mechanisms. In

2014 IEEE congress on evolutionary computation (CEC)

, pages 2515–2523. IEEE, 2014.
• [2] Carlos Ansótegui, Meinolf Sellmann, and Kevin Tierney. A gender-based genetic algorithm for the automatic configuration of algorithms. In International Conference on Principles and Practice of Constraint Programming, pages 142–157. Springer, 2009.
• [3] Anne Auger and Benjamin Doerr, editors. Theory of Randomized Search Heuristics. World Scientific, 2011.
• [4] Golnaz Badkobeh, Per Kristian Lehre, and Dirk Sudholt. Black-box complexity of parallel search with distributed populations. In Proceedings of Foundations of Genetic Algorithms (FOGA 2015), pages 3–15. ACM Press, 2015.
• [5] Thomas Bartz-Beielstein, Christian Lasarczyk, and Mike Preuß. The sequential parameter optimization toolbox. In Experimental methods for the analysis of optimization algorithms, pages 337–362. Springer, 2010.
• [6] Thomas Bartz-Beielstein, Christian WG Lasarczyk, and Mike Preuß. Sequential parameter optimization. In Evolutionary Computation, 2005. The 2005 IEEE Congress on, volume 1, pages 773–780. IEEE, 2005.
• [7] Mauro Birattari, Thomas Stützle, Luis Paquete, and Klaus Varrentrapp. A racing algorithm for configuring metaheuristics. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, pages 11–18. Morgan Kaufmann Publishers Inc., 2002.
• [8] Edmund K. Burke, Michel Gendreau, Matthew Hyde, Graham Kendall, Gabriela Ochoa, Ender Özcan, and Rong Qu. Hyper-heuristics: A survey of the state of the art. Journal of the Operational Research Society, 64(12):1695–1724, 2013.
• [9] B. Doerr, C. Doerr, and J. Yang. Optimal Parameter Choices via Precise Black-Box Analysis. July 2018.
• [10] Benjamin Doerr.

Analyzing randomized search heuristics: Tools from probability theory.

In Theory of Randomized Search Heuristics: Foundations and Recent Developments, pages 1–20. World Scientific, 2011.
• [11] Benjamin Doerr, Carola Doerr, and Timo Kötzing. The right mutation strength for multi-valued decision variables. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO ’16), pages 1115–1122. ACM, 2016.
• [12] Benjamin Doerr, Carola Doerr, and Jing Yang. -bit mutation with self-adjusting outperforms standard bit mutation. In Proc. of the International Conference on Parallel Problem Solving from Nature, LNCS 9921, PPSN ’16, pages 824–834. Springer International Publishing, 2016.
• [13] Benjamin Doerr, Carola Doerr, and Jing Yang. Optimal parameter choices via precise black-box analysis. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO ’16), pages 1123–1130, New York, NY, USA, 2016. ACM.
• [14] Benjamin Doerr, Carola Doerr, and Jing Yang. Optimal parameter choices via precise black-box analysis. arXiv preprint arXiv:1807.03403, 2018.
• [15] Benjamin Doerr, Thomas Jansen, Carsten Witt, and Christine Zarges. A method to derive fixed budget results from expected optimisation times. In Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO ’13), pages 1581–1588. ACM, 2013.
• [16] Benjamin Doerr, Andrei Lissovoi, Pietro S. Oliveto, and John Alasdair Warwicker. On the runtime analysis of selection hyper-heuristics with adaptive learning periods. In Proceedings of the Genetic and Evolutionary Computation Conference 2018 (GECCO ’18). ACM, 2018.
• [17] Stefan Droste, Thomas Jansen, Karsten Tinnefeld, and Ingo Wegener. A new framework for the valuation of algorithms for black-box optimization. In Proceedings of Foundations of Genetic Algorithms III (FOGA 2002), pages 253–270, 2002.
• [18] Stefan Droste, Thomas Jansen, and Ingo Wegener. On the analysis of the (1+ 1) evolutionary algorithm. Theoretical Computer Science, 276(1-2):51–81, 2002.
• [19] Aguston Eiben, Robert Hinterding, and Zbigniew Michalewicz. Parameter control in evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 3(2):124–141, 1999.
• [20] William Feller. An Introduction to Probability Theory and Its Applications, volume 1. Wiley, 3rd edition, 1968.
• [21] William Feller. An Introduction to Probability Theory and Its Applications, volume 2. Wiley, 2nd edition, 1971.
• [22] Alex S. Fukunaga. Automated discovery of local search heuristics for satisfiability testing. Evolutionary Computation, 16(1):31–61, 2008.
• [23] Jonathan Gratch and Gerald DeJong. An analysis of learning to plan as a search problem. In Machine Learning Proceedings 1992, pages 179–188. Elsevier, 1992.
• [24] John J. Grefenstette. Optimization of control parameters for genetic algorithms. IEEE Transactions on systems, man, and cybernetics, 16(1):122–128, 1986.
• [25] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. ParamILS: an automatic algorithm configuration framework.

Journal of Artificial Intelligence Research

, 36(1):267–306, 2009.
• [26] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
• [27] Thomas Jansen and Christine Zarges. Fixed budget computations: A different perspective on run time analysis. In Proceedings of the 14th annual conference on Genetic and evolutionary computation, pages 1325–1332. ACM, 2012.
• [28] Thomas Jansen and Christine Zarges. Performance analysis of randomised search heuristics operating with a fixed budget. Theoretical Computer Science, 545:39–58, 2014.
• [29] Ashiqur R. KhudaBukhsh, Lin Xu, Holger H. Hoos, and Kevin Leyton-Brown. SATenstein: Automatically building local search SAT solvers from components. Artificial Intelligence, 232:20–42, 2016.
• [30] Per Kristian Lehre and Ender Özcan. A runtime analysis of simple hyper-heuristics: To mix or not to mix operators. In Foundations of Genetic Algorithms, FOGA ‘13, pages 97–104, New York, NY, USA, 2013. ACM.
• [31] Per Kristian Lehre and Dirk Sudholt. Parallel black-box complexity with tail bounds. arXiv preprint arXiv:1902.00107, 2019.
• [32] Per Kristian Lehre and Carsten Witt. Black-box search by unbiased variation. Algorithmica, 64(4):623–642, 2012.
• [33] Johannes Lengler and Nicholas Spooner. Fixed budget performance of the (1+1) EA on linear functions. In Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII, pages 52–61. ACM, 2015.
• [34] Andrei Lissovoi, Pietro S. Oliveto, and John Alasdair Warwicker. On the runtime analysis of generalised selection hyper-heuristics for pseudo-boolean optimisation. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 849–856. ACM, 2017.
• [35] Andrei Lissovoi, Pietro S. Oliveto, and John Alasdair Warwicker. Simple hyper-heuristics optimise leadingones in the best runtime achievable using randomised local search low-level heuristics. arXiv preprint arXiv:1801.07546, 2018.
• [36] Andrei Lissovoi, Pietro S. Oliveto, and John Alasdair Warwicker. On the time complexity of algorithm selection hyper-heuristics for multimodal optimisation. AAAI ‘19, 2019. To appear.
• [37] Manuel López-Ibáñez, Jérémie Dubois-Lacoste, Leslie Pérez Cáceres, Mauro Birattari, and Thomas Stützle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43–58, 2016.
• [38] Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in neural information processing systems, pages 59–66, 1994.
• [39] Steven Minton. Integrating heuristics for constraint satisfaction problems: A case study. In AAAI 1993, pages 120–126, 1993.
• [40] Samadhi Nallaperuma, Frank Neumann, and Dirk Sudholt. Expected fitness gains of randomized search heuristics for the traveling salesperson problem. Evolutionary Computation, 25(4):673–705, 2017. PMID: 27893278.
• [41] Pietro S. Oliveto and Carsten Witt. Simplified drift analysis for proving lower bounds in evolutionary computation. Algorithmica, 59(3):369–386, 2011.
• [42] Pietro S. Oliveto and Carsten Witt. Erratum: Simplified drift analysis for proving lower bounds in evolutionary computation. arXiv preprint arXiv:1211.7184, 2012.
• [43] Chao Qian, Ke Tang, and Zhi-Hua Zhou. Selection hyper-heuristics can provably be helpful in evolutionary multi-objective optimization. In Proceedings of the International Conference on Parallel Problem Solving from Nature, PPSN ’16, pages 835–846. Springer, 2016.
• [44] Maria-Cristina Riff and Elizabeth Montero. A new algorithm for reducing metaheuristic design effort. In Evolutionary Computation (CEC), 2013 IEEE Congress on, pages 3283–3290. IEEE, 2013.
• [45] Jonathan E. Rowe and Dirk Sudholt. The choice of the offspring population size in the (1,) evolutionary algorithm. Theoretical Computer Science, 545:20–38, 2014.
• [46] Christian Scheideler. Probabilistic Methods for Coordination Problems. HNI-Verlagsschriftenreihe 78, University of Paderborn, 2000. Habilitation Thesis, available at http://www14.in.tum.de/personen/scheideler/index.html.en.
• [47] Thomas Stützle and Manuel López-Ibáñez. Automated Design of Metaheuristic Algorithms, pages 541–579. Springer International Publishing, 2019.

## Appendix A Proofs Omitted from the Main Part

This appendix contains proofs omitted from the main part of the paper due to space restrictions.

### a.1. Proof of Lemma 3.5

###### Proof.

Using the notation from Lemma 3.4, we have and , which implies since . Further, , and . This implies . Using and , we obtain