Theory of Parameter Control for Discrete Black-Box Optimization: Provable Performance Gains Through Dynamic Parameter Choices

04/16/2018 ∙ by Benjamin Doerr, et al. ∙ 0

Parameter control aims at realizing performance gains through a dynamic choice of the parameters which determine the behavior of the underlying optimization algorithm. In the context of evolutionary algorithms this research line has for a long time been dominated by empirical approaches. With the significant advances in running time analysis achieved in the last ten years, the parameter control question has become accessible to theoretical investigations. A number of running time results for a broad range of different parameter control mechanisms have been obtained in recent years. This book chapter surveys these works, and puts them into context, by proposing an updated classification scheme for parameter control.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Evolutionary algorithms and many other iterative black-box optimization heuristics are parametrized algorithms; i.e., their search behavior depends (to a large extent) on a set of parameters which the user needs to specify, or which are set by the algorithm designer to some default values. It is today well understood that the parameter choice can have a very decisive influence on the performance of the heuristic [LLM07]. Understanding how to best choose the parameters is therefore an important task. It is referred to as the parameter selection problem.

The parameter setting problem is difficult for several reasons.

  • Complexity of Performance Prediction. Despite significant research efforts devoted to this problem, predicting how the performance of an algorithm depends on the chosen parameter values remains a very challenging problem—both with empirical and theoretical methods. In fact, determining optimal parameter values can be very complex already for a single parameter. Many black-box optimization heuristics, however, rely on two or more parameters. Rigorously analyzing the interdependency between these parameters is often infeasible by state-of-the-art technology.

  • Problem- and Instance-Dependence. It is well known that no globally good parameter values exist, but that suitable parameter values can differ substantially between different optimization problems, and even between different instances of the same problem.

  • State-Dependence. It is furthermore widely acknowledged that the best parameter values can change during the optimization process. For example, it is often beneficial to use larger mutation rates in the beginning of an optimization process, to allow for a faster exploration, and to shrink the search radius over time, to allow for a better exploitation in the later stages, cf. Section 2 for a detailed example.

To overcome these difficulties, a large number of different parameter setting techniques have been developed. Following standard terminology in evolutionary computation, they can be classified into two main approaches:

  • Static Parameter Settings: Parameter Tuning. Parameter tuning aims at identifying parameter values that are, for a given algorithm on a given problem (instance), globally suitable throughout the whole optimization process. The parameters are initialized with these values and do not change during the optimization process. Parameter tuning thus addresses the above-mentioned problem- and instance-dependence of optimal parameter choices, but not their state-dependence.

    In empirical works, parameter tuning often requires an initial set of experiments that support an informed decision. Automated tools that help the user to identify reasonable static parameter values are available, and have shown to bring significant performance gains over a manual tuning process. [LDC16, BBFKK10, AMS15, HHLBS09, HHLB11] are examples for automated parameter tuning approaches that have been used in (and to some extend specifically designed for) evolutionary optimization contexts.

    In theoretical works, parameter tuning requires running time bounds that depend on the parameters under investigation. The minimization of these performance bounds then suggests suitable parameter values. A prime example for such a mathematical approach towards parameter tuning is the precise running time bound for the  EA with mutation rate on linear functions. Witt [Wit13] has shown that this expected optimization time is . This bound, together with larger running time bounds for mutation rates , proves that the often recommended choice is indeed optimal for the  EA on this problem. Such precise upper and lower bounds, however, are rare. Even worse, only few running time bounds that depend on two or more parameters exist, cf. Section 4.3.

  • Dynamic Parameter Setting: Parameter Control. Parameter control, in contrast, aims to benefit from a non-static choice of the parameters, with the underlying idea that the flexibility in the behavior can be used to adjust the algorithms’ behavior to the current state of the optimization process. Put differently, parameter control does not only aim at identifying parameter values that are a good compromise for the whole optimization process, but aims also at tracking the evolution of the best parameter values. Even when the optimal parameter values are rather stable, the role of parameter control is to identify these values on the fly, without a dedicated tuning step that precedes the actual optimization process.

This book chapter focuses on non-static parameter choices, and thus on parameter control mechanisms. We survey existing theoretical works of parameter control in the context of evolutionary algorithms and other standard black-box optimization heuristics. We also summarize a few standard techniques used in the empirical research literature.111Readers interested in empirical works on parameter control are referred to [KHE15] for an exhaustive survey. Additional pointers can be found in the systematic literature review [AM16], the book chapter [EMSS07] (and other book chapters in the same collection) and the seminal paper [EHM99]. We structure our presentation by a new classification scheme for parameter control mechanisms. This taxonomy builds on the well-known classification by Eiben, Hinterding, and Michalewicz [EHM99], but modifies it to better reflect the developments that parameter control has witnessed in the last 20 years.

This book chapter is structured as follows. We motivate the use of non-static parameter choices in Section 2 by demonstrating a simple example where adaptive parameter selection is provably beneficial. We then introduce our revised classification scheme in Section 3. In the subsequent Sections 4 to 8 we survey existing theoretical results. In Section 9 we conclude this book chapter with a discussion of promising avenues for future work. A summary of selected theoretical running time results covered in this book chapter can be found in Table 2.

2 A Motivating Example: (1+1) EA and RLS on LeadingOnes

We start this section with an example that demonstrates potential advantages of parameter control mechanisms. To this end, we study the well-known LeadingOnes benchmark, the problem of minimizing an unknown function of the type

where and is a permutation (one-to-one map) of the set . Optimizing corresponds to identifying , the unique optimum of . Note that for every the function value is the length of the longest prefix that and have in common, when comparing the strings in the order prescribed by .

It has been shown in [BDN10] that the  EA with static mutation rate needs iterations, on average, to optimize a LeadingOnes instance [BDN10]. This term is minimized for , which yields an expected optimization time of around . It was observed in [BDN10] that a fitness-dependent choice of the mutation rate gives a better optimization time. More precisely, when denotes the current-best individual, and we choose in the next iteration as mutation rate , then the expected optimization time decreases to around . This is almost better than the expected optimization time of the  EA with standard mutation rate and about better than the mentioned expected running time which the best static mutation rate achieves.

Also Randomized Local Search (RLS), the algorithm which flips in each iteration one uniformly selected bit and selects the better of the two offspring as parent individual for the next iteration, can profit from a non-static choice of the step size, i.e., the number of bits that it flips in every iteration. It is well known that RLS needs iterations, in expectation, to optimize an -dimensional LeadingOnes instance. In Figure 1 we take a closer look at the optimization process, and plot the expected number of iterations (-axis) needed by RLS to identify, on the -dimensional LeadingOnes problem, a solution of fitness value at least (-axis). This is the blue straight line. We also illustrate in the same figure the corresponding expected fixed-target running times of the RLS variant which flips in each iteration exactly 2 and 3 pairwise different bits, respectively. These are the yellow and gray curves, respectively. The lower-most, black line illustrates the expected performance of the RLS variant which chooses in each iteration the best of these three parameter values. We observe that this adaptive variant has an expected optimization time that is around smaller than that of standard 1-bit flip RLS. We also see that for Lo-values smaller than , it is advisable to flip more than one bit per iteration, while 1-bit flips are optimal once a solution of Lo-value has been identified. This can be best seen by comparing the slopes of the curves in this plot of fixed-target running times. The ultimate goal of parameter control is the design of mechanisms that detect such transitions and suggest best possible parameter values for the different stages in an automated way.

Figure 1: Expected fixed-target running times of RLS variants flipping in each iteration exactly 1, 2, 3, or an adaptive number of bits. The adaptive variant, which chooses the best among the three parameter values, has a total expected optimization time that is about 20% better than RLS, which always flips one bit per iteration

We note that in the example discussed in this section, “only” constant factors could be gained through the dynamic parameter choice, but that in general also asymptotic performance gains can be expected. As example where this has been rigorously proven will be discussed in Section 4.3.

3 Classification of Parameter Control Mechanisms

A considerable obstacle to overcome when searching for previous works on non-static parameter choices is the lack of a commonly agreed-upon terminology. This has led to a situation in which similar techniques have significantly different names, and, conversely, the same term being used for two fundamentally different concepts. Since 1999 a widely accepted classification scheme for parameter setting has been the taxonomy proposed by Eiben, Hinterding, and Michalewicz in [EHM99]. We present this classification in Section 3.1, and modify it to cope with the developments in parameter control of the last twenty years in Section 3.2.

3.1 The Classification Scheme of Eiben, Hinterding, and Michalewicz

Eiben, Hinterding, and Michalewicz [EHM99] distinguish three different types of parameter control, namely deterministic, self-adaptive, and adaptive parameter settings.

  • A dynamic parameter choice is called deterministic if the choice of the parameter value does not depend on the fitness landscape encountered by the algorithm. Since there is thus no feedback from the optimization process into the parameter choice, the parameter value can only depend on iteration or time counters.

    It was noted already in [EHM99]

    that the term “deterministic” is misleading, since a time-dependent parameter choice may still contain randomized elements, that is, the time or iteration counter determines a probability distribution from which the parameter value is sampled. As alternative names for this class of update schemes the terms

    scheduled or feedback-free parameter control might be more appropriate.

  • In self-adaptive parameter choices, the parameters are encoded into the representation of the search points and are thus subject to variation operators. The hope is that the better parameter values yield better offspring and are thus more likely to survive the evolutionary process. By this, implicitly, the choice of the parameters depends on the optimization process and thus, in particular, on the fitness function.

  • Adaptive parameter choices are dynamic parameter settings in which there is an explicit dependence of the parameters on the optimization process. This large category includes structurally simple success-based update rules like those resembling the 1/5-th success rule from evolution strategies, but also learning-inspired techniques which choose the parameter values depending on statistics from the optimization process so far.

3.2 A Revised Classification Scheme

At the time of writing of [EHM99], the three different types of parameter control discussed in Section 3.1

were of similar importance. In the last almost twenty years, however, we observe an increasing interest (and massive progress) in the subcategory of adaptive parameter control schemes, which also play a predominant role within the theoretical studies. In particular, the last years made it quite clear that the substantial differences between, say, a simple deterministic fitness-dependent choice of a parameter value and a parameter choice via reinforcement-learning approaches motivate to not have both in the same category. We therefore present in the next subsection an alternative classification scheme, which takes into account this development.

  • State-Dependent Parameter Control. We classify as state-dependent parameter control those mechanisms that depend only on the current state of the search process, e.g., the current population, its fitness values, its diversity, but also a time or iteration counter. Hence this subsumes the previous “deterministic” category (containing time-dependent parameter choices) and all other parameter setting mechanisms which determine the current parameter values via a pre-specified function mapping algorithm states to parameter values, possibly in a randomized manner. All these mechanisms require the user to precisely specify how the parameter value depends on the current state and as such need a substantial understanding of the problem to be solved.

  • Success-Based Parameter Control. To overcome the usability challenges and the inflexibility of state-dependent parameter control mechanisms, several approaches to set the parameters in a self-adjusting manner have been proposed. As one important type of self-adjusting parameter control mechanisms, we classify as success-based parameter settings all those mechanisms that change the parameters from one iteration to the next. In other words, the parameter value to be used in the current iteration is determined (possibly in a randomized manner) by the parameter value used in the previous iteration and by an evaluation how successful the previous iteration was. The success measure can be a simple binary information like whether a solution with superior fitness was found, but it could also take into account the quantitative information like the fitness gain or loss in this iteration. Depending on the parameter to be set, also other quantities than the fitness can be taken into account, e.g., the evolution of the diversity of the population.

    The most common form of success-based rules are multiplicative updates of parameters, which increase or decrease the parameter value by suitable factors depending on whether the previous iteration was classified as success or not. Success-based rules other than multiplicative updates have been designed as well. For example, in [DGWY17] the offspring were generated with two different parameter values and the information which parameter value led to the best offspring determined the parameters of the next iteration, cf. Section 5.2.3 for a detailed discussion.

  • Learning-Inspired Parameter Control. As the main second type of self-adjusting parameter control mechanisms, we classify as learning-inspired parameter control mechanisms all those schemes which aim at exploiting a longer search history than just one iteration. To allow such learning mechanisms to also adapt quickly to changing environments, older information is taken into account to a lesser extend than more recent ones. This can be achieved by only regarding information from (static or sliding) time windows or by discounting the importance of older information via weights that decrease (usually exponentially) with the anciency of the data.

    Most learning-inspired parameter control mechanism that have been experimented with in the evolutionary computation context borrow tools from machine learning, where a similar problem known as the

    multi-armed bandit problem is studied, cf. Section 6.2.

  • Endogenous Parameter Control (Self-adaptation). This category corresponds to the self-adaptive parameter control mechanisms in the taxonomy of [EHM99]. We prefer the name endogenous parameter control as it best emphasizes the structural difference of these mechanisms, which is to encode the parameters in the genome and to let them evolve via the usual variation and selection mechanisms of the evolutionary system.

  • Hyper-Heuristics. Hyper-heuristics are algorithms that operate on a set of low-level heuristics, select from it an algorithm, and run it for some time, before re-evaluating which of the low-level heuristics to use next. The main hope is that the hyper-heuristics automate the algorithm selection and configuration process, in a way that allows for maximizing the profit from different algorithmic ideas in the different stages of the optimization process. Similar to the motivation behind endogenous parameter control, the use of a high-level hyper-heuristic is guided by the belief that the high complexity of the parameter control problem calls for efficient heuristic approaches.

Figure 2 summarizes our classification scheme. Existing theoretical results are summarized in the next sections, which are structured according to this taxonomy.

Figure 2: Classification of Parameter Control Mechanisms. We call success-based and learning-inspired mechanisms also self-adjusting.

We emphasize that our classification is partly driven by the historical development of the field. For example, it would be more logical to not have hyper-heuristics (as long as they essentially optimize parameters) as a separate category, but rather classify them as success-based or learning-inspired parameter control schemes. Since historically the area of hyper-heuristics developed relatively independently (partially due to the fact that there are many hyper-heuristics that cannot be seen as parameter control mechanisms), we prefer to maintain an own category for hyper-heuristics.

4 State-Dependent Parameter Control

We recall from the previous section that state-dependent parameter selection schemes are those mechanisms which choose the parameter values based only on the current state of the algorithm, without making use of the search history. One of the best known examples for state-dependent parameter control is the so-called cooling schedule used by Simulated Annealing. The idea of this cooling schedule is to start the heuristic with a rather generous acceptance behavior, and to increase the selective pressure during the optimization process, cf. Section 4.1 for a more detailed description. The cooling schedule, as the name suggests, is a time-dependent selection mechanism, which maps the iteration counter to a temperature value that defines the selective pressure.

As we shall see in this section, time-dependent parameter selection schemes have also been experimented with in the context of evolutionary computation. In addition, other state-dependent parameter settings, like rank- and fitness-based mutation rates and diversity-based parameter choices have been analyzed empirically, but have received considerably less attention in the theory of evolutionary algorithms community.

4.1 Time-Dependent Parameter Choices

Simulated Annealing is typically not regarded as an evolutionary algorithm, since it draws inspiration from the physical phenomenon of an annealing process. We nevertheless decided to discuss it in this book chapter, as it is structurally very similar to Randomized Local Search, and certainly falls in the class of iterative randomized black-box optimization heuristics.

Simulated Annealing [KGV83] is a (1+1)-type search heuristic that uses a Boltzmann selection rule to decide whether or not to replace the previous parent individual by a new solution . More precisely, the algorithm keeps in its memory only one previously evaluated solution , and modifies it by a local variation. In case of pseudo-Boolean maximization this local move is identical to that of RLS, i.e., the offspring is created from by flipping exactly one bit, the position of which is chosen uniformly at random. The new solution always replaces if it is better, and it replaces with probability otherwise. That is, the better , the larger the probability that it survives the selection procedure. The novelty of Simulated Annealing over its predecessor, the Metropolis algorithm [MRR53], is an adaptive choice of the “temperature” in the Boltzmann selection rule: while the Metropolis algorithm uses the same throughout the whole optimization process, the value of is decreased over time in Simulated Annealing, either with each iteration or, more commonly, after a fixed number of iterations. The adaptive selective pressure results in a more generous acceptance behavior at the beginning of the optimization process (to allow for faster exploration), and a more and more elitist selection towards the end (“exploitation”). Algorithm LABEL:alg:SASA summarizes this algorithm. For constant we obtain from Algorithm LABEL:alg:SASA the pseudo-code for the Metropolis algorithm. Numerous successful applications and more than 43,000 citations222This citation number is according to Google Scholar as of April 12, 2018. of [KGV83] witness that this idea to control the selective pressure during the optimization process can have an impressive impact on the performance.

algocf[htbp]    

A number of theoretical results analyzing the performance of Simulated Annealing exist. Most of these prove convergence to a global optimum for suitably chosen parameter settings, cf. the book chapter [HJJ03]

for a summary of selected theoretical and empirical results. In addition to results mentioned there, a plethora of running time results exist for combinatorial optimization problems on graphs, including most notably matching 

[SH88] and graph bisection problems [CI01, Imp01, JS93]. Selected theoretical works that concentrate on the advantages of dynamic parameter choices are summarized below.

Answering an open problem posed in [JS97], Wegener presented in [Weg05] a problem class for which Simulated Annealing outperforms its static counterpart, the Metropolis algorithm, regardless of how the temperature value is chosen in the latter. More precisely, Wegener proves that Simulated Annealing with multiplicative temperature decay ( being a constant and the initial value being ignorant of the instance, but possibly depending on the number of edges and the maximal weight ) has a better expected optimization time on some subclasses of the Minimum Spanning Tree (MST) problems than the Metropolis algorithm with any fixed temperature. Previous examples for this phenomenon had been presented in [Sor91] and [DJW00], but were of a rather artificial nature. The novelty of [Weg05] was thus to prove this statement for a natural combinatorial optimization problem. A particular instance of the MST problem for which Wegener proved the superiority of Simulated Annealing is a graph that has the form of connected triangles. Wegener also showed a provable advantage for -separated graphs, in which non-equal weights are apart from each other by a constant factor of at least , cf. [Weg05, Section 5].

One of the first works analyzing a classic evolutionary algorithm with a dynamic parameter setting was presented by Droste, Jansen and Wegener in the above-mentioned work [DJW00]. Besides a time-dependent selection strategy, the authors also analyze the  EA with a time-dependent mutation rate . In this algorithm, the mutation rate is initialized as and doubled in every iteration until exceeds , in which case it is reset to . An example function, PathToJump, is presented for which the  EA with the time-dependent mutation rate needs only steps, on average, to locate the optimum, while the  EA with static mutation rate does not optimize PathToJump in expected polynomial time. The authors also show a converse result in which the dynamic  EA is much slower than the classical static one. It is not difficult to see that the dynamic EA performs worse than the static  EA on most classic benchmark functions like OneMax, LeadingOnes, etc., cf. [DJW00, Section 4]. This work was later extended and simplified by Jansen and Wegener in [JW06].

In [JW07] a comparison is made between the  EA with static and with time-dependent mutation rates on the one hand, and Simulated Annealing and the Metropolis algorithm on the other hand, but the focus of this work is not on the advantages of adaptive parameter choices, but rather on a comparison of the different selection schemes.

4.2 Rank-Dependent Parameter Control

Motivated by empirical work reported in [CS09], Oliveto, Lehre, and Neumann analyzed in [OLN09] a EA with rank-based mutation rates. In this algorithm, the individuals of the parent population are ranked according to their fitness values and the mutation rate applied in some iteration depends on the rank of the (uniformly selected) individual undergoing mutation. The intuition behind this rank-based mutation rates is that individuals at larger ranks (i.e., worse fitness) should be modified more aggressively (suggesting large mutation rates), while the best individuals of the population should be modified with caution, suggesting small mutation rates.

To be more precise, the algorithm proposed in [CS09] uses standard bit mutation with mutation rate , where for the -th ranked search point the value of is set to

(linear interpolation ensuring a minimal mutation rate of

and a maximal mutation rate ). The variant studied in [OLN09] uses , , and . Theorem 1 below gives a general upper bound for the rank-based  EA, which is better than the expected running time of the  EA on functions like Needle or Trap.

Theorem 1 (Theorems 1 and 2 in [Oln09]).

For and , the expected optimization time of the EA with rank-based mutation rates is at most333This bound is mistakenly stated as in [OLN09, Theorem 1], but the proof clearly shows the here-stated upper bound. for any pseudo-Boolean function , and it is for OneMax.444We recall that OneMax is the function that assigns to each the number of ones in it; i.e., . All running time bounds that we state in this chapter for the optimization of OneMax also apply to the optimization of the functions , whose fitness landscape is isomorphic to that of .

In addition to these results, examples are constructed for which the EA with rank-based mutation rates performs significantly worse [OLN09, Section V] and significantly better [OLN09, Section VI] than the classical EA with standard bit mutation rate .

4.3 Fitness-Dependent Parameter Control

While rank-based parameter selection had originally been introduced with the hope to find a generally well-functioning control scheme, fitness-based parameter selection schemes are often highly problem-tailored, and cannot be assumed to work particularly well when applied to different objective functions. The theoretical results stated below should therefore not be considered as a suggestion for generally applicable parameter control mechanisms, but rather as a point of comparison for more plausible, general-purpose parameter update techniques; i.e., we should use these results only as a lower bound for the performance of a best possible parameter update scheme. This way, the results form a baseline that helps us understand and judge the limits of parameter control.

4.3.1 Fitness-Dependent Mutation Rates for the (1+1) EA on LeadingOnes

The first work showing a significant advantage of a fitness-dependent choice of the mutation rate has been presented in [BDN10], where the following result is shown.555Prior to [BDN10], fitness-dependent mutation rates had also been analyzed in immune algorithms [Zar09, Zar08], but no advantage of the analyzed parameter choices could be shown.

Theorem 2 (Theorems 3 to 6 in [Bdn10]).

On LeadingOnes, the expected number of iterations needed by the  EA with static mutation rate to identify the optimal solution is . This expression is minimized for , which gives an expected optimization time of around .

For the  EA variant that chooses in every iteration the fitness-dependent mutation , where denotes the solution that undergoes modification, the expected optimization time decreases to around . No other fitness-dependent mutation rate can achieve a better expected optimization time.

In this result the expected optimization time of the fitness-dependent  EA is almost better than the expected optimization time of the  EA with standard mutation rate and about better than the expected running time which the best static mutation rate achieves.

4.3.2 Fitness-Dependent Mutation Rates for the  EA on OneMax

Interestingly, the question how to best control the mutation rate during the optimization process gained relevance with the establishment of black-box complexity as a measure for the best possible running time that any randomized search heuristic of a certain type can achieve (cf. [Doe18] for a survey of works on this complexity notion). By comparing existing algorithms with the theoretically best possible performance, one can judge how well suited a given approach is. Non-surprisingly, the best-possible algorithms take into account the state of the optimization process, and adjust their parameters accordingly.

In this context, and more precisely, in the context of analyzing lower bounds for the performance of unbiased parallel evolutionary algorithms, Badkobeh, Lehre, and Sudholt analyzed in [BLS14] the optimal fitness-dependent mutation rate for the EA on OneMax. The main result is summarized by the following theorem.

Theorem 3 (Theorems 3 and 4 in [Bls14]).

For the EA that uses in each iteration the mutation rate (where denotes the parent individual held in the memory at the beginning of the iteration) has an expected optimization time on OneMax equal to .

This performance is best possible among all unary unbiased black-box algorithms that create offspring in parallel.

The performance of this fitness-dependent EA for many values of is superior to the performance of the EA with the static mutation rates regarded so far, which is for mutation rate , a constant, see [DK15, GW17], and for and  [DGWY17, Lemma 1.2].

In Section 5.2.3 we will see an example for a purely success-based adaptation scheme which achieves the same expected performance as the EA with fitness-dependent mutation rate. Most recently, a self-adaptive EA has been designed, which also achieves the same bound. This algorithm will be discussed in Section 7.

4.3.3 Fitness-Dependent Mutation Strengths for RLS on OneMax

While the result in Section 4.3.2 is of asymptotic order only, one might hope to get more precise results for selected values of . Unfortunately, the precise relationship between function values and optimal mutation rates is not even known in the very special case . What is known, however, is the following.

In [DDY16b] it is shown that the best possible running time on OneMax that any unary unbiased black-box algorithm can achieve is for a constant between and . It cannot be better by more than an additive term than the expected optimization time attained by the RLS variant that chooses in every iteration the mutation strength (i.e., the number of bits to be flipped) in a way that maximizes the expected progress. By the symmetry of the OneMax function, this drift-maximizing mutation strength depends only on the fitness of the current-best solution, and not on the structure of this search point. More precisely, when different bits of the search point are flipped to create , the expected progress equals

(1)

The drift-maximizing mutation strength is the value of that maximizes this expression.666No easy to interpret algebraic relationship between and could be established in [DDY16b], and an approximation of is therefore used in that work. It is shown, however, that this affects the overall performance by at most iterations.

Theorem 4 (Theorem 9 in [DDY16b]).

The expected optimization time of the drift-maximizing algorithm with fitness-dependent mutation strengths is for a constant between and . The unary unbiased black-box complexity is smaller than by an additive term of at most .

Compared to RLS or the RLS variant using an optimized initialization phase presented and analyzed in [dPdLDD15], the bound in Theorem 4 is smaller by an additive term between and . For problem dimensions the advantage of the drift-maximizing algorithm over classic RLS is around .

In the language of fixed-budget computation as introduced by Jansen and Zarges in [JZ14] the drift-maximizing algorithm with a budget of at least iterations computes a solution with expected fitness distance to the optimum roughly smaller than the output proposed by RLS [DDY16b, Section 6].

4.3.4 Fitness-Dependent Offspring Population Sizes in the Genetic Algorithm

All the results above concern the control of the mutation rate. A fitness-dependent choice of the offspring population sizes was considered in [DDE15] for the  GA on OneMax. Since this algorithm later gave rise to a growing interest in parameter control (note that the conference version [DDE13] appeared before most of the other results mentioned in this section), we describe this algorithm in more detail. Note in particular that in contrast to the purely mutation-based algorithms mentioned above, the  GA also uses crossover.

The  GA (Algorithm LABEL:alg:SAga) works with a parent population of size one. This population is initialized with a search point chosen from uniformly at random. The  GA then proceeds in iterations, each consisting of a mutation phase, a crossover phase, and a final elitist selection step determining the new parent population.

In the mutation phase, a step size

is chosen at random from the binomial distribution

, where the parameter is called the mutation rate of the algorithm. Then independently offspring are created by flipping exactly (i.e., pairwise different) random bits in . In an intermediate selection step, one best mutation offspring is selected as mutation winner. In the crossover phase, again offspring are created; this time via a biased uniform crossover between  and , taking each entry from with probability only and taking the entry from otherwise. Again, an intermediate selection chooses one of the best crossover offspring as crossover winner. In the final selection step, this replaces if its fitness is at least as large as the fitness of ; i.e., if and only if holds.

algocf[t]    

The  GA has thus three parameters that need to be set prior to any execution: the offspring population size , the mutation rate , and the crossover bias . Using intuitive considerations, it was suggested in [DDE15] to use and . With these choices, the 3-dimensional parameter space is reduced to a one-dimensional one, and only needs to be set. In [DDE15] it was shown that choosing yields an expected running time of for the  GA on the OneMax problem. This bound was later improved to in [DD15b]; this expected running time is attained for a slightly larger value of , namely . Finally, [Doe16] showed that the suggested dependencies and are asymptotically optimal in the sense that any static parameter combination that gives an expected running time of needs to satisfy , , , and . No parameter combination can achieve an asymptotically better running time than .

The results mentioned above all concern static parameter values. In terms of dynamic parameters, it was observed already in [DDE15] that a better expected running time, namely a linear one, can be achieved by the  GA on OneMax if we allow the parameters to depend on the function values. This linear expected performance has later been shown to be asymptotically optimal.

Theorem 5 (Theorem 8 in [Dde15] and Sections 5 and 6.5 in [Dd18]).

The expected optimization time of the  GA with , , and on OneMax is , and this is asymptotically best possible among all dynamic parameter choices. For any static parameter values the expected running time of the  GA on OneMax is of order at least , and thus strictly larger than linear.

In Section 5 we will discuss a success-based parameter control mechanism that identifies and tracks good values for in an automated way.

5 Success-Based Parameter Control

As success-based parameter control mechanisms we classified all those which change the parameters from one iteration to the next, based on the outcome of the iteration. This includes in particular multiplicative update rules which change parameters by constant factors depending on whether the iteration was considered a success or not.

5.1 The 1/5-th Success Rule and Other Multiplicative Success-Based Updates

Already the very early works on evolution strategies used a simple, yet powerful technique to adapt the parameters online. The so-called 1/5-th success rule, which was independently discovered in [Rec73, Dev72, SS68], suggests to set the step size of an evolution strategy in such a manner that -th of the iterations lead to a fitness improvement. The idea behind this is that when the success rate is higher, then most likely the step size is too small and time is wasted on minor improvements; however, when the success rate is smaller, then time is wasted by waiting too long for an improvement. The value was derived from some theoretical considerations for the performance of the (1+1) evolution strategy on the sphere problem . Rechenberg showed that a success rate of about yields optimal expected gain for this problem (and also on another problem with a so-called inclined ridge, cf. [Rec73] for details).

The first implementations of this 1/5-th success rule were not success-based in our language, but rather observed the success rate over several iterations and then adjusted the step size if a discrepancy from the target success rate of was detected. In [KMH04], a simpler success-based implementation was proposed. Here, the step size is multiplied by some number in case of success and divided by in case of no success. The hyper-parameter is called the update strength of the adaptation rule.

We next present two examples for success-based parameter control suggested in the literature.

Example 1: the 1/5-th success rule applied to the  GA. It may be surprising that a simple multiplicative success-based rule can work. We therefore present an illustrated example, the self-adjusting  GA, which has originally been proposed in [DDE15] and later been formally analyzed on the OneMax problem in [DD15a]. We will describe this algorithm in more detail in Section 5.2.1, but note here only that by using the recommended dependencies and the self-adjusting  GA requires to set the offspring population size as only parameter. The value of is adapted based on the success of a full iteration, using the above-sketched implementation of the 1/5-th success rule suggested in [KMH04]. Figure 3 shows how well the optimal fitness-dependent value of the offspring population size suggested by Theorem 5 (smooth black curve) is approximated by this multiplicative success-based update rule (staggered red curve). The uppermost (blue) curve shows the evolution of the current-best fitness value, from which the optimal fitness-dependent mutation rate is computed. Note that in this figure we show the optimal mutation rates per iteration, each of which costs function evaluations. The update strength in this illustration is set to .

Figure 3: Application of the -th success rule to the offspring population size in the  GA on OneMax

Example 2: The  EA with success-based offspring population size . A different success-based parameter control has been suggested in [HGO95] for the control of the offspring population size in a non-elitist  evolution strategy (ES). Motivated by a theoretical result that proves that in the  ES the so-called local serial progress is maximized when the expected progress of the second best offspring created in one iteration is zero (this result applies to any function ), the following multiplicative success-based update rule for the offspring population size has been suggested. Denoting by the parent individual of the -th iteration, by the selected offspring population size, and by the offspring created in the -th iteration, sorted by decreasing function values, then the offspring population size for the next iteration is set to

(2)

where is a hyper-parameter that controls the speed of the adaptation. While this update mechanism, to the best of our knowledge, has not been formally analyzed, it is shown in [HGO95] to give good performance on the hyper-plane and the hyper-sphere problem.

5.2 Theoretical Results for Success-Based Parameter Control

In this section we describe the theoretical results known for success-based based parameter control mechanisms. We note that some works on hyper-heuristics resemble closely a success-based parameter control. The reader can find these in Section 8.3.

5.2.1 The Self-Adjusting  GA on OneMax and on MaxSAT

We have seen in Theorem 5 that the  GA with mutation rate , crossover bias , and fitness-dependent population size takes an expected number of function evaluations to optimize a OneMax instance of problem dimension . This is the asymptotically best running time among all static and dynamic parameter choices. A substantial drawback of this result is the rather complex dependence of on the current-best function value . The question whether this relationship can be detected by a parameter control mechanism in an automated way suggests itself. In fact, already in [DDE15] a success-based choice of was suggested, and shown to achieve a very similar empirical performance as the fitness-dependent choice, across all tested problem dimensions . In [DD18] the efficiency of this success-based variant of the  GA, which we will describe in more detail below, could be formally proven.

Theorem 6 (Theorem 9 in [Dd18]).

The expected optimization time of the self-adjusting  GA (Algorithm LABEL:alg:SAgaself) with mutation rate , crossover bias , and sufficiently small update strength on OneMax is .

algocf[t]    

The success-based choice of the parameter uses the above-mentioned implementation of the 1/5-th success rule considered in [KMH04]. That is, after an iteration that led to an increase of the best observed function value (“success”), is reduced by a constant factor . If an iteration was not successful,  is increased by the multiplicative factor . Consequently, after a series of iterations with an average success rate of , this mechanism ends up with the initial value of .

Since , the value of is capped at . Likewise, it is capped from below at . The value of is allowed to be non-integral. Where an integer is required (i.e., in lines 6,7,9, and 10 of Algorithm LABEL:alg:SAgaself),  is rounded to its closest integer. That is, in these four lines, instead of we regard if the fractional part of is less than and we regard otherwise.

In the experiments conducted in [DDE13], see in particular Figure 4 there, all update strengths worked well. While this indicates some robustness of the result in Theorem 6 with respect to the -value, it has been argued in [DD18, Section 6.4] that update strengths greater than may lead to an exponential expected optimization time on OneMax. A commonly used value for , also used in Auger’s implementation [Aug09], is . This is also the value with which Figure 3 has been created.

One may further wonder how important is the relationship of the two multiplicative updates, that is, the exponent . It is argued in [DD18, Section 6.4] that a similar result as in Theorem 6 is likely to hold for a range of other exponents as long as the exponent is not too large. Hence in discrete optimization, there is no particular reason for a -th rule (that is, the exponent ). This has also been observed in a recent work on image composition, where a success-based -th success rule was used to adjust the length of a random walk that is part of the mutation operator [NSCN17]. In a set of initial experiments seemed to be a suitable value, and is used for the empirical evaluations.

Being the first algorithm which can provably reduce the expected optimization time by applying a success-based parameter control mechanism, the self-adjusting  GA has been analyzed also on other functions, by empirical and theoretical means. Already in [DDE15, Section 4] a promising empirical performance for linear functions with random weights and for so-called royal road functions with block size was reported. In [GP14] the self-adjusting  GA is tested on a number of combinatorial problems. In particular for the maximum satisfiability problem, the self-adjusting  GA shows a very good performance, beaten only by the parameterless population pyramid proposed in the same work. Inspired by this result, a mathematical running time analysis of the  GA on random satisfiability instances was conducted in [BD17]. It confirms that the  GA has a better performance than solely mutation-based algorithms, see, e.g., [DNS17]. The work however also shows that weaker fitness-distance correlation of the satisfiability instances can lead to the effect that when offspring are created with a high mutation rate, then the algorithm has problems determining the structurally better ones. This difficulty can be overcome by imposing an upper limit on the population size , which determines the mutation rate .

5.2.2 The  EA with Success-Based Offspring Population Size

For the  EA, the following success-based adaptation of the offspring population size has been suggested in [JDJW05, Section 5]. The offspring population size is initialized as one. After each iteration, the number of offspring having a function value that is at least as large as that of the parent fitness is determined. When (i.e., if the iteration has been unsuccessful), the offspring population size is doubled, otherwise it is replaced by . The intuition for this adaptive choice of the offspring population is to have the value of inversely proportional to the probability of creating an offspring that replaces its parent. This algorithm, which we call the  EA, had not been analyzed by mathematical means in [JDJW05], but showed encouraging empirical performance on OneMax, LeadingOnes, and a benchmark function called SufSamp.

The idea of a success-based offspring population size was taken up in [LS11], where a theoretical analysis of two similar success-based update schemes was performed. The first update scheme, the  EA, doubles in case no strictly better search point could be identified and sets to one otherwise. The second  EA variant, the  EA, also doubles if no solution of quality better than the parent is found, and reduces to otherwise. While these schemes do not result in an improved overall running time in terms of function evaluations, they are both able to achieve a significant reduction of the parallel optimization time on selected benchmark problems. That is, the average number of generations needed before an optimal solution is evaluated for the first time is smaller than that of classical sequential EAs, which do not perform any evaluations in parallel. The precise results are as follows.

Function Algorithm
OneMax  EA [*]
 EA
LeadingOnes  EA
 EA
unimodal with different  EA
function values  EA
,  EA [*]
 EA
Table 1: Expected sequential and parallel running times of the  EA and the  EA on selected benchmark problems [LS11]. For the two bounds marked [*], we slightly improve the original bound of via an elementary argument, cf. proof below Theorem 7
Theorem 7 (Theorem 7 in [Ls11] and proof below for the results marked [*).

in Table 1] The sequential and parallel expected running time of the  EA and the  EA satisfy the bounds given in Table 1.

Proof.

Using the classic fitness level method, the expected parallel running time of the  EA on OneMax is bounded from above by in [LS11]. This expression is further bounded by . However, a closer look reveals that with Stirling’s formula, we easily obtain

This improved bound immediately carries over to the bound for , , where the expected parallel running time of the  EA is bounded by the expected parallel running time on OneMax plus the time needed to “jump” from the local optimum to the global one, which is of order at most . ∎

5.2.3 The 2-Rate  EA with Success-Based Mutation Rates

In the previous examples, we have studied different ways to control the offspring population size. We now turn our attention to a success-based adaptation of the mutation rates in a  EA with fixed offspring population size , which has been introduced and analyzed in [DGWY17]. The  EA stores a parameter that controls the mutation rate. This parameter is adjusted after each iteration by the following mechanism. In each iteration, the  EA creates offspring by standard bit mutation with mutation rate , and it creates offspring with mutation rate . At the end of the iteration a random coin is flipped. With probability the value of is replaced randomly by either or and with the remaining probability it is set to the value that the winning individual of the last iteration has been created with. Finally, the value is capped at if it smaller, and at , if it exceeds this value. Algorithm LABEL:alg:SAopladaptive summarizes this 2-rate  EA variant.

algocf[htbp]    

Theorem 8 (Theorem 1.1 in [Dgwy17]).

Let and . The expected optimization time of the  EA on OneMax is .

By the result presented in Theorem 3 above, the expected running time achieved by the  EA is best possible among all -parallel black-box algorithms.

5.2.4 Success-Based Mutation Strengths for the Multi-Variate OneMax Problem

In [DDK16] a success-based choice of the mutation strength has been proven to be very efficient for a multi-variate generalization of the OneMax problem. Concretely, the authors study three different classes of generalized OneMax functions. Denoting the size of the alphabet by , the first class contains, for all , the functions , the second all functions , while the third class subsumes all functions . Unlike all other settings regarded in this chapter, [DDK18] studies the minimization of these OneMax generalizations. In our description below, we stick to this optimization target, to ease the comparison with the original publication.

The self-adjusting algorithm studied in [DDK16] is a RLS-variant, which flips one coordinate in every iteration. For each coordinate , a velocity is stored, which denotes the mutation strength at this coordinate. When in iteration coordinate is chosen for modification, the entry of the current-best solution is replaced by with probability and by otherwise. The entries in positions are not subject to mutation. The resulting string replaces if its fitness is at least as good as the one of , i.e., if holds (we recall that we aim at minimizing ). If the offspring is strictly better than its parent , i.e., if holds, the velocity in the -th component is increased by multiplying it with a fixed constant and is decreased to otherwise, where is again some fixed constant. If the value of drops below 1 or exceeds , it is capped at these values.

Theorem 9 (Theorem 17 in [Ddk18]).

For constants satisfying , , , , and the expected running time of the on any of the generalized -valued OneMax function , and , is . This is asymptotically best possible among all comparison-based variants of RLS and the  EA.

In this theorem, the update strengths can be chosen, for example, as and , imitating the above-mentioned interpretation of the 1/5-th success rule proposed in [KMH04].

Using a result proven in [DRWW10], it is argued in [DDK18, Section 6.1] that the expected running time of the self-adaptive RLS variant is better by a multiplicative factor of at least than any RLS or  EA variant using static step sizes. The optimality of the bound follows from the simple information-theoretic lower bound which applies to all comparison-based algorithms, while the lower bound applies to any unary unbiased black-box algorithm.

5.2.5 Success-Based Migration Intervals for Parallel EAs in the Island Model

A multiplicative success-based adaptation scheme has also been used to adjust the migration interval in a parallel (1+1) EA with a fixed number of islands. Mambrini and Sudholt [MS15] apply the two schemes described in Section 5.2.2 for the control of the offspring population size of the  EA now to control the migration interval. In their parallel EA, every island has its own migration interval at the end of which it broadcasts its current-best solution to all of its neighbors. In the variant of the parallel EA (Algorithm 2 in [MS15]), improved solutions are always broadcast instantly, to all neighboring islands, and the migration interval of the corresponding island is set to one. It is set to one also if during the migration interval at least one superior solution has migrated to the island. The migration interval is doubled otherwise, i.e., if at the end of the migration period no strictly better solution has been identified or migrated from a different island.

In the scheme (Algorithm 3 in [MS15]), the broadcast happens only at the end of the migration interval, which is again doubled in case no improved solution could be identified nor migrated from another island, and halved otherwise.

The scheme is analyzed for the complete graph topology, for which all migration intervals are identical. For the variant [MS15] proves results for general graph topologies with islands as well as for a few selected topologies like the unidirectional ring, the grid, a torus, etc. The results comprise upper bounds on the expected communication effort needed to optimize general black-box optimization benchmarks, cf. Sections 4 and 5 in [MS15]. These bounds are then applied to the same benchmark functions as those regarded in Theorem 7. In some cases, including the complete graph topology, the adaptive migration intervals are shown to outperform any static choice in terms of expected communication effort, without (significantly) increasing the expected parallel running time. Table 1 in [MS15] summarizes the results for the selected benchmark problems. The bounds proven in [MS15] are upper bounds, and the question of complementing these with meaningful lower bounds seems to remain an open problem.

6 Learning-Inspired Parameter Control

In contrast to the success-based control mechanisms discussed in the previous section, we call learning-inspired all those self-adjusting parameter control mechanisms which are based on information obtained over more than one iteration.

6.1 Adaptive Operator Selection

An important class of parameter control schemes takes inspiration from the machine learning literature, and in particular from the multi-armed bandit problem. These adaptive operator selection techniques777The term “operator” is used because the adaptive operator selection mechanisms have originally not only been designed to choose between different parameter values but also between different actions, such as different variation operators. maintain a portfolio of possible parameter values. At each step they decide which of the possible parameter values to use next. To this end, they assign to each possible parameter value a confidence value

. This confidence value is supposed to be an indicator for how suitable the corresponding value is at the given stage of the optimization process. The confidence can be, for example, an estimator for the likelihood or the magnitude of progress we would obtain from running the algorithms with this value. These confidence values determine or modify the

probabilities of choosing the corresponding parameter value. We present below three ways to implement this adaptive operator selection principle.

What distinguishes the parameter control setting from the classically regarded setting in machine learning is the fact that the “rewards”, i.e., the gain that we can obtain with a given value, can drastically change over time, compared to the static (but random) reward typically investigated in the machine learning literature. The non-static reward distributions change the complexity of the algorithms and the theoretical analysis considerably. As far as we know, the only theoretical work rigorously proving an advantage of learning-based parameter control is [DDY16a], which we shall discuss in more detail in Section 6.2. Despite the promising empirical performance of adaptive operator selection techniques, none of the techniques mentioned below could establish itself as a standard routine. Potential reasons for this situation include the complexity of these techniques, the difficulty of finding good hyper-parameters that govern the update rules, and a lack of theoretical support.

  • Probability Matching. This technique aims at assigning the probabilities proportionally to the confidence values, while maintaining for each parameter value a minimal probability for being sampled. Concretely, in round we choose the -th parameter value with probability

    where is the total number of different parameter values from which we can choose (the size of the portfolio) and the confidence in parameter value at time .

    After executing one iteration with the -th parameter value, its confidence value is updated to

    where denotes the (normalized) reward obtained in the -th round and is the hyper-parameter that determines the speed of the adaptation. The confidence value of parameter values that have not been selected in the -th round are not updated.

  • Adaptive Pursuit. When larger portfolios used, the previous mechanism choosing the operator with probability roughly proportional to the confidence value might not give enough preference for the truly best choice. To this aim, a more “aggressive” update rule has been suggested: adaptive pursuit. This selection scheme uses the same confidence values as Probability Matching, but applies a much more progressive update rule for the probabilities. In Adaptive Pursuit the probabilities of selection are obtained from the probabilities of the previous iteration according to a “the winner takes it all” policy. Concretely, the “best” arm, i.e., the parameter value with the highest confidence value is assigned a probability of , while for all other parameters the probability of being sampled is reduced to . Empirical comparisons of Probability Matching and Adaptive Pursuit are presented in [Thi05]. In general, it seems that Adaptive Pursuit is more suitable for situations in which the quality differences between the potential parameter values are small, but persistent.

  • Upper Confidence Bound. The upper confidence bound (UCB)-algorithm, originally proposed in [ACBF02], plays an important role in machine learning, as it is one of the few strategies that can be proven to behave optimally in a classical operator selection problem. More precisely, the UCB algorithm can be proven to achieve minimal cumulative regret in the multi-armed bandit problem in which the reward of each “arm” follows a static probability distribution. Interpreting the different “arms” as the different parameter values that we want the algorithm to choose from, the UCB algorithm chooses in every step the parameter value that maximizes the expression

    where is an estimate for the expected reward of the -th parameter value, is the number of times the -th parameter value has been chosen in the first iterations, and is a hyper-parameter that determines the bias between exploiting parameter values with high expected reward and exploring parameter values that have not yet been tested very often. While being provably optimal in static settings, the UCB algorithm is rather sedate, and thus not very well suited for environments that gradually change over time—the typical situation encountered in the optimization of rather smooth optimization problems. In the parameter control context, it therefore makes sense to replace by an index that counts the number of occurrences in a given time interval only, instead of considering the whole history (sliding window, cf. [FCSS10] for a detailed discussion and experimental results on two discrete benchmark problems). In contrast, when the environments change abruptly, a combination of the UCB algorithm with a statistical test that detects significant changes in the fitness landscape has been shown to perform very well [CFSS08, FCSS09].

6.2 Theoretical Results for Learning-Inspired Parameter Control

The first, and so far only, theoretical work that rigorously analyzes a learning-inspired parameter selection scheme is [DDY16a]. The algorithm proposed there is a generalized version of randomized local search (RLS), which selects in every step the number of bits to be flipped according to the following rule. With probability a random one of the possible mutation strengths is chosen, and with the remaining probability the algorithm greedily selects the parameter value for which the expected progress (coined velocity in [DDY16a]) is maximized. The expected progress is estimated by a time-discounted average of the progresses observed in the learning iterations. More precisely, the velocity of mutation strength at time is defined via

(3)

where is the parameter value used in the -th iteration, and the hyper-parameter determines the speed of the adaptation process. [DDY16a] refer to as the forgetting rate, inspired by the observation that the reciprocal of the forgetting rate is (apart from constant factors) the information half-life. Note here that compared to [DDY16a], we have changed the meaning of and , to be in line with the classical literature in machine learning, where the algorithm from [DDY16a] would be classified as an -greedy selection scheme (meaning that with probability a random choice is taken and otherwise a greedy choice).

The main theoretical result in [DDY16a] is a proof that, for suitably selected hyper-parameters and , this algorithm essentially always uses the best possible mutation strength when run on OneMax. More precisely, it is shown that in all but a fraction of the iterations the selected parameter value achieves an expected progress that differs from the best possible one by at most some lower order term. Consequently, the algorithm has the same optimization time (apart from a additive lower order term) and the same asymptotic 13% superiority in the fixed budget perspective as the fastest algorithm which can be obtained from these mutation strengths, which again comes arbitrary close (by taking large) to performance of the hand-crafted mutation strength schedule presented in Theorem 4.

Theorem 10 (Theorems 1 and 2 in [DDY16a]).

Let be the minimal expected running time that any randomized local search algorithm using a fitness-dependent mutation strength of at most can achieve on OneMax. Then the expected running time of the -greedy RLS variant from [DDY16a] with hyper-parameters , , and is .

In the fixed-budget perspective, the following holds. Let be the best solution that the -greedy RLS variant with this parameter setting has identified within the first iterations. Similarly, let be the best solution that the classic RLS using -bit flips only has found within the first iterations. For the expected Hamming distances to the optimum satisfy

The hyper-parameters in this result were taken as one example where this algorithm shows a superior performance. As noted in [DDY16a], the particular choice of these parameters is not overly critical. Clearly, has to be to ensure that at most iterations are performed with a sub-optimal mutation strength. Likewise, has to be to ensure that information learned iterations ago (and thus at a time when the velocities could be substantially different) has no significant influence on the current decision.

In addition to this theoretical result, [DDY16a] also presents empirical results for the LeadingOnes and the minimum spanning tree (MST) problem. These experimental works suggest that, for suitably chosen hyper-parameters , , and , the average optimization time of the -greedy RLS variant can be significantly smaller than that of the  EA. It even outperforms, empirically, RLS on LeadingOnes, and the RLS variant that always flips either one or two random bits in the current-best solution on the MST problem.

7 Self-Adaptation: Endogenous Parameter Control

As we have seen in the previous sections, an elegant way to overcome the difficulty of finding the right parameters of an evolutionary algorithm and to cope with the fact that the optimal parameter values may change during a run of the algorithm is to let the algorithm optimize the parameters on the fly. However, formally speaking, this is an even more complicated task, because we now have to design a suitable parameter setting mechanism. While a number of natural heuristics like the -th rule have proven to be effective in certain cases, it would be even more elegant to not add an exogenous parameter control mechanism onto the algorithm, but to rather integrate the parameter control mechanism into the evolutionary process, that is, to attach the parameter value to the individual (consequently, there is no global parameter value, but each individual carries its own parameter value), to modify it via (extended) variation operators, and to use the fitness-based selection mechanism of the algorithm to ensure that good parameter values become dominant in the population.

This self-adaptation of the parameter values has two main advantages.

  • It is generic, that is, the adaptation mechanism is provided by the algorithm and only the representation of the parameter in the individual and the extension of the variation operators has to be provided by the user.

  • It allows to re-use existing algorithms and existing code.

Despite these advantages, self-adaptation is not used a lot in discrete evolutionary optimization (unlike in continuous optimization), and consequently, there is also little theoretical work on this topic.

Self-adaptation for discrete evolutionary computation was proposed in the seminal paper [Bäc92] by Bäck, which also contains a mathematical convergence proof for the mutation rate (in the particular setting proposed there). Apart from this result, only two works on running time analysis for self-adapting parameter choices appeared so far. Since these results, as the paper by Bäck, are concerned with self-adaptive mutation rates, we discuss self-adaptation only for mutation rates in the following and note that other parameters could be optimized via self-adaptation in a similar way.

7.1 Implementing Self-Adaptive Mutation Rates

To use self-adaptation for the mutation rate, the individuals (which are usually possible solution candidates) have to be extended to also contain “their” mutation rate. In the purest possible form, as done by Bäck [Bäc92], this is implemented via appending additional bits to the bit-string which represents the solution candidate. These additional bits encode in a suitable manner the mutation rate. This pure form has the advantage that any standard variation operator can be used directly on the extended individuals. The down-side of this approach is that non-binary data is artificially treated like binary decision variables.

It has been argued, e.g., in [DDK18], that it can be preferable to encode non-binary data in their original form and to modify it via data-specific variation operators. In the context of self-adaptation, the mutation rate has been encoded as floating point number in in [BS96, KLR11]

, which is mutated according to a log-normal distribution. In the recent theoretical works 

[DL16] and [DWY18], only a discrete set of possible mutation rates was allowed. In [DWY18], the mutation rates with being a power of two were used. As mutation, the rate was replaced by a random choice between and .

With either representation of the mutation rate, the extended mutation operator (acting on the extended individuals) will always be such that first the encoded mutation rate is mutated and then the core individual is mutated with this new rate. This is necessary for the subsequent selection step to see an influence of the new mutation rate and thus, hopefully, prefer individuals with a more profitable mutation rate.

Finally, when designing a self-adaptive parameter optimization scheme one may want to prefer non-elitist algorithms. An elitist algorithm carries the risk of getting stuck with individuals that have a high fitness, but a very unprofitable mutation rate. In this situation, progress can only be made when the mutation of the mutation rate in one iteration changes the rate to a value that admits an improvement. In other words, it is not possible to change the rate in several iterations if no improvement is made.

7.2 Theory for Self-Adaptive Mutation Rates

In the first work analyzing self-adaptation through the running time analysis paradigm, Dang and Lehre [DL16] regard the following setting. They use a simple non-elitist algorithm which in each iteration generates from a population of individuals a new population of again individuals. This is done by times independently selecting an (extended) parent individual from the current population, mutating it via the (extended) mutation operator, and adding it to the new population. For the mutation rate, Dang and Lehre assume that there is only a finite set of pre-specified rates (for most results they take ). The extended mutation operator first with probability , which is a global parameter of the algorithm, replaces the current rate of the individual by a random different one, then it mutates the core individual via standard bit mutation with the new rate. For the selection operator, a wide range of choices are subsumed in this work, since the results are phrased in terms of a parameter of the selection operator, namely the reproductive rate. A selection operator (possibly depending on a fitness function ) has reproductive rate if for all populations and each individual of the population, the expected number of times was chosen in independent applications of the selection operator, is at most . For example, selecting always a best individual from the population leads to , whereas a uniform random selection gives .

For this setting, the following results are shown. If a mutation rate satisfies for some constant , then the algorithm always using the rate (equivalent to the case that ) and using random initialization needs with high probability an at least exponential time to reach the optimum of any pseudo-Boolean function with unique optimum (this is Theorem 2 of [DL16] in the special case of ).

If two rates are used, that is, , and the mutation operator chooses the current rate of the individual uniformly at random, then even if only one of the rates satisfies the dangerous condition , the above problem can remain: If , , and for constants , then again an at least exponential running time results with high probability (Theorem 4). This result again applies to any pseudo-Boolean function having a unique optimum.

The latter of these two results shows that randomly mixing a good and a bad operator can be essentially as bad as using the bad operator alone. This is not overly surprising, but points out the contrast with the following result for a self-adaptive choice of the mutation rate. For a suitable example function  it is proven that the algorithm with a suitably initialized population, with tournament selection with tournament size , with population size , and with a self-adaptive choice between the two mutation rates and finds the optimum of  in a polynomial running time, whereas either of these two rates alone or randomly mixing between them leads to an at least exponential running time with high probability.

As for almost all such examples, also this one is slightly artificial and needs quite some assumptions, for example, that all individuals are initialized with the unique local optimum. Nevertheless, this result demonstrates that self-adaptation can outperform static parameter choices and random mixing. The reason for this is that, as the proofs reveal, the self-adaptation is able to find in relatively short time the mutation rate which is most profitable (as opposed to fixed parameter choices) and to remember it (as opposed to random mixing).

Very recently, a less artificial example for the use of self-adaptation was presented in [DWY18]. There it was shown that the  EA with a self-adaptive choice of the mutation rate can achieve an asymptotically identical performance as the self-adjusting  EA presented in [DGWY17] (see also 5.2.3). In the self-adaptive setting of [DWY18], the extended individuals store their mutation rate, which is for an integer . The extended mutation operator first changes to or (uniform random choice) and then performs standard-bit mutation with the new mutation rate . One of the offspring with maximum fitness is selected as new parent individual. In case of ties, individuals with smaller rate are preferred, which creates a small extra drift towards the usually recommended rates of order . It is shown that when , then this algorithm finds the optimum of the OneMax function in an expected number of iterations, which is the asymptotically best possible running time for -parallel algorithms (cf. Theorem 3 cited from [BLS14]).

8 Hyper-Heuristics

Hyper-heuristics are search or optimization heuristics which during the run of the algorithm choose in a possibly adaptive manner which low-level heuristics to use. Since in some situations hyper-heuristics can closely resemble an adaptive parameter choice, we describe in this section what is known about such hyper-heuristics.

8.1 Brief Introduction to Hyper-Heuristics

Hyper-heuristics either choose from a pre-specified set of low-level heuristics (selection hyper-heuristics) or try to generate low-level heuristics from existing components (generation hyper-heuristics). There is a considerable amount of applied research on generation hyper-heuristics, e.g., for scheduling problems, packing problems, satisfiability, and the traveling salesman problem. However, since there appears to be no theoretical work on generation hyper-heuristics and since, naturally, generation hyper-heuristics are substantially different from parameter control mechanisms, we do not further detail this sub-area and refer, as for all other topics incompletely covered here, to the recent survey [BGH13].

As true in general for optimization heuristics, hyper-heuristics can also be divided into construction hyper-heuristics and perturbation hyper-heuristics. The former try to construct a solution from partial solutions. This has led to interesting results, e.g., in production scheduling, educational timetabeling, or vehicle routing. Since constructing a solution from partial solutions necessarily is a rather problem-specific approach, it is not surprising that general theoretical results for this sub-area do not yet exist.

In contrast, perturbation hyper-heuristics work, in a similar manner as classic evolutionary algorithms, with complete solution candidates, which are randomly modified in the hope of gaining superior solutions. Perturbation selection hyper-heuristics found numerous applications, among others, in various scheduling contexts. The most common form of perturbative selection hyper-heuristics are single-point searches, which in a fashion analoguous to  EAs and  EAs repeat creating one or more offspring from a single parent and selecting the next parent from these offspring and the previous parent. For such selection hyper-heuristics, some general mechanisms how to choose the low-level heuristic creating the offspring were proposed, see Section 8.3.

As said above, selection hyper-heuristics are methods that select, during the run of the algorithm, which one out of several pre-specified simpler algorithmic building blocks to use. When the different pre-specified choices are essentially identical apart from an internal parameter, then this selection hyper-heuristic could equally well be interpreted as a dynamic choice of the internal parameter. For example, when only the two mutation operators are available that flip exactly one or exactly two bits, then a selection hyper-heuristic choosing between them could also be interpreted as the randomized local search heuristic using a dynamic choice of the number of bits it flips. Conversely, some of the works described previously could equally well be described in the language of simple selection hyper-heuristics. In this text, we follow the language used by the original authors and do not aim at drawing a line between the different fields.

We now describe the main theoretical works that appeared in the hyper-heuristics community as long as they resemble dynamic parameter control mechanisms, the main topic of this chapter.

8.2 Random Mixing of Low-Level Heuristics

8.2.1 Markov Chain Analyses

The first theoretical study on selection hyper-heuristics was conducted by He, He, and Dong [HHD12]. They regard the variant of the classic  EA which in each iteration selects a mutation operator from a finite set of operators according to a fixed probability distribution. In the hyper-heuristics language, this is a single-point selection heuristics using a mixed strategy. He et al. show that the asymptotic convergence rate and the asymptotic hitting time resulting from any mixed strategy are not worse than those of resulting from exclusively using the worst of the given operators.

Some care is necessary when interpreting this result. The asymptotic hitting time as defined in [HHD12] is not the asymptotic order of magnitude of the classic hitting time (number of iterations until the optimum was generated), but is the spectral radius of the fundamental matrix

of the Markov chain describing the parent individual in a run of this single-point heuristic, where

is the identity matrix and

is the transition matrix restricted to the non-optimal search points. This asymptotic hitting time is only loosely related to the classic hitting time. Denoting by the classic hitting time of this Markov chain (usually called optimization time of the EA) when started in the state , then only the week relation

is known, where is the set of all non-optimal search points. Consequently, the asymptotic hitting time only provides a lower bound for the worst-case expected hitting time . Note that the best-case expected hitting time often is very small as witnessed by search points that are very close to the optimum. Consequently, the lower bound for the worst-case hitting time given by can be relatively weak. Nothing is known how the asymptotic hitting time is related to the running time starting from a random search point, which is the usual performance measure. For these reasons, it is not clear how to translate the result of [HHD12] into the classic running time analysis language.

8.2.2 Running Time Analysis of Mixed Strategies

The first to conduct a running time analysis for selection hyper-heuristics in the classic methodology were Lehre and Öczan [LÖ13]. In [LÖ13, Theorem 3], it is stated that the  EA 888We note that some authors prefer to call the algorithm used in [LÖ13] a variant of randomized local search rather than an evolutionary algorithm since it only creates offspring in a bounded distance from the parent. using the mixed strategy of choosing in each iteration the mutation operator randomly between the 1-bit flip operator (with probability ) and the -bit flip operator (with probability ) optimizes the OneMax function in an expected time of at most

(4)

It appears to us that this result is not absolutely correct, since, e.g., in the case

the expected optimization time is clearly infinite: If the random initial search point has an odd Hamming distance from the optimum, then the optimum cannot be reached only via

-bit flips. For similar reasons, the expected running time has to be larger than in (4) for very small values of . We therefore prove the following result.

Theorem 11.

Consider the  EA with the mixed mutation strategy of flipping a single random bit with probability and flipping two (different) random bits with probability . Let be the running time (number of iterations) of this algorithm on the OneMax benchmark function. If , then

If , then with probability the algorithm never finds the optimum (and thus the expected running time is infinite).

Proof.

For the case , we note that with probability exactly the random initial search point has an odd Hamming distance from the optimum.999This well-known fact follows from the beautiful argument . Since -bit flips change the Hamming distance by , , or , the algorithm can never reach the optimum in this case.

Hence let us assume for the remainder of this proof. When the current search point of the  EA has a Hamming distance of from the optimum, then the probability that one iteration ends with a better search point is

Using and for all , the classic fitness level theorem yields

Above we used the equation valid for all , which can be shown easily by induction.

For , we also have the estimate