# Drift Theory in Continuous Search Spaces: Expected Hitting Time of the (1+1)-ES with 1/5 Success Rule

This paper explores the use of the standard approach for proving runtime bounds in discrete domains---often referred to as drift analysis---in the context of optimization on a continuous domain. Using this framework we analyze the (1+1) Evolution Strategy with one-fifth success rule on the sphere function. To deal with potential functions that are not lower-bounded, we formulate novel drift theorems. We then use the theorems to prove bounds on the expected hitting time to reach a certain target fitness in finite dimension d. The bounds are akin to linear convergence. We then study the dependency of the different terms on d proving a convergence rate dependency of Θ(1/d). Our results constitute the first non-asymptotic analysis for the algorithm considered as well as the first explicit application of drift analysis to a randomized search heuristic with continuous domain.

## Authors

• 22 publications
• 14 publications
• 27 publications
05/22/2018

### First-Hitting Times Under Additive Drift

For the last ten years, almost every theoretical result concerning the e...
07/09/2013

### General Drift Analysis with Tail Bounds

Drift analysis is one of the state-of-the-art techniques for the runtime...
12/09/2013

### A Unified Markov Chain Approach to Analysing Randomised Search Heuristics

The convergence, convergence rate and expected hitting time play fundame...
11/15/2010

### Non-Existence of Linear Universal Drift Functions

Drift analysis has become a powerful tool to prove bounds on the runtime...
05/22/2018

### Intuitive Analyses via Drift Theory

Humans are bad with probabilities, and the analysis of randomized algori...
12/04/2019

### A probability theoretic approach to drifting data in continuous time domains

The notion of drift refers to the phenomenon that the distribution, whic...
06/21/2019

### Sharp Bounds on the Runtime of the (1+1) EA via Drift Analysis and Analytic Combinatorial Tools

The expected running time of the classical (1+1) EA on the OneMax benchm...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Erratum

The version of this paper published at ACM-GECCO 2018 contains some technical errors. The errors do not affect the correctness of the theorems. They are corrected in this version. For clarity, the changes compared to the GECCO version are marked in blue.

## 1 Introduction

The standard methodology for proving runtime bounds of evolutionary algorithms defined on a discrete search space is often referred to as drift analysis. It consists in proving a drift condition, e.g., expected change strictly smaller than

(additive drift) w.r.t. a potential, that directly translates into a bound on the hitting time to reach the optimum. It allows to decouple generic mathematical arguments, summarized in drift theorems, from arguments specific to the algorithm. With drift analysis, proofs that could take several pages before have been simplified considerably [11, 5, 4, 12, 13].

In this work we explore the utility of such an approach for the analysis of algorithms operating on a continuous domain. For this purpose we focus on the analysis of the -ES with one-fifth success rule on the sphere function ,

. We are particularly interested in benefits of drift analysis over current tools for analyzing continuous randomized search heuristics, like investigating stability of Markov chains.

#### The (1+1)-ES

We focus here on one of the simplest adaptive algorithms, namely the evolution strategy (ES) with one-fifth success rule [14]. It is defined in algorithm 1, where we assume minimization of a function . The state of the algorithm at iteration is , where is the mean of the Gaussian sampling distribution and also the best solution found so far, and

is the standard deviation of the distribution or “step-size” that controls the distance at which novel solutions are sampled. This variant of the algorithm, which was first proposed in

[10]

, implements Rechenberg’s idea of maintaining a probability of success of roughly

. This algorithm is not a “toy” algorithm as it features the important flavor of the widely used state-of-the-art CMA-ES [6], namely adaptation of the sampling distribution.

#### Drift Analysis in Rd

Interestingly, although drift theorems are often formulated for finite domains, they naturally generalize to continuous domains [12, 13]. To date however, drift analysis in the style of discrete domains has not been explicitly applied to analyze continuous algorithms. Note that drift conditions are also central in other approaches addressing convergence in continuous domains, while they are typically not used for obtaining bounds on the hitting time (see below). At the same time, we will see that some difficulties can arise when dealing with continuous search spaces as it seems natural to use a potential function that converges to minus infinity when approaching the optimum. To overcome those problems we formulate novel drift theorems.

Analyzing state-of-the art continuous evolutionary algorithms means analyzing adaptive algorithms. While this adaptation is the key for the practical success of ES (ensuring linear convergence on wide classes of problems, similar to gradient-based methods on strongly convex functions), in turn it makes the analysis difficult. Indeed, when is too small compared to , the progress towards the optimum is very small. This complicates the task of finding a suitable potential function and proving a drift condition.

When analyzing algorithms in continuous domains, our goals are (i) to establish how fast the algorithm converges for a fixed dimension  (usually linear convergence), and (ii) to investigate the dependency of the convergence rate on the search space dimension (usually )—this is different from discrete domains where the optimum can be located in finite time. In terms of hitting time to reach a certain precision , property (i) means that for all the expected hitting time is finite and proportional to , while property (ii) means that it is also proportional to .

#### Related work

Some of the drift methodology is underlying many results of J. Jägersküpper [7, 8, 9]. Drift is not uncovered explicitly in these works, which makes it arguably difficult to follow the analysis carried out. That might be the reason why nobody built so far on Jägersküpper’s impressive work. We also have to point out that Algorithm 1 differs from the variant analyzed by Jägersküpper, where the step-size is kept fixed for several iterations.

For a fixed dimension, the linear convergence of the algorithm on scaling-invariant functions—including in particular the sphere function—has been shown using Markov chain analysis [2]. This analysis is asymptotic in nature and does not provide a dependency of the convergence rate on the dimension. The difficult part in the approach also relies on proving a drift condition, that should however hold only outside a compact set, not on the whole domain.

Drift of a step size adaptive algorithm is also analyzed in [3], the only prior work that uses a potential function in a continuous domain. That approach remains very restricted, applying only to symmetric functions of a single variable.

In this work, we go beyond the state-of-the-art as follows. Other than Jägersküpper’s results, our bounds provide (non-asymptotic) constants, and they hold with full probability. In contrast to Markov chain analysis, we obtain a dependency in the dimension and non-asymptotic results. Compared to [3], we go beyond a proof of concept by analyzing a simple yet realistic algorithm.

#### Outline

The rest of the paper is organized as follows. In the next section we introduce novel drift theorems for lower and upper bounds to deal with unbounded potentials, since this is a natural design in our context. In Section 3, we prove technical results needed to derive the drift condition for the upper bound. In Section 4 we define our potential function and show two drift conditions for the lower and the upper bound. By applying the drift theorems to the drift conditions we derive lower and upper bounds on the first hitting time, corresponding to linear convergence with scaling of the convergence rate. For the sake of readability, all proofs are in the appendix.

#### Notation

A multivariate normal distribution is denoted

. With we denote the cumulative density function of the standard normal distribution on , and is the pdf of the standard normal distribution on . The indicator function of a set or condition is denoted by .

## 2 Additive Drift on an Unbounded Domain

In the continuous setting considered in this paper, we aim at proving a runtime bound that translates into linear convergence. Linear convergence is typically pictured as the log of the distance to the optimum converging to minus infinity like with . It is thus natural to construct a potential function that involves the log of the distance to the optimum. Yet, this means that the potential function can take values that are arbitrarily negative, while in drift theorems it is typical to assume that the potential function is lower bounded (by zero or one). For this reason we need to adapt existing drift theorems.

We adopt the following formalism. Let

be a sequence of real-valued random variables adapted to a filtration

. In our typical setting can be homogeneous to the logarithm of the distance to the optimum and thus go to minus infinity when linear convergence occurs. Additionally, from one iteration to the next, can be arbitrarily much smaller than . This happens if by chance we have made an atypically good step that improves the current solution a lot.

While arbitrarily good steps should be helpful in the sense of making the hitting time only smaller, we face the technical difficulty to distinguish this situation from the following scenario: assume a process with an average decrease of , i.e., fulfilling , but where equals with probability , and with probability we jump to

, possibly overjumping the target in a single but very improbable step. The time needed to sample this jump is geometrically distributed with expectation

, resulting in an arbitrarily large hitting time. This small example illustrates that controlling only the drift is not enough for bounding the expected hitting time. If the domain is bounded from below then the size of a possible jump is also bounded, avoiding this difficulty. Therefore we have to find a way of controlling extreme events.

To circumvent this problem, instead of controlling directly the drift on we will control the drift of a process with truncated and hence bounded single-step progress. More precisely, for given we consider the truncated process defined iteratively as and

 Y\scaleobj0.7At+1=Y\scaleobj0.7At+max{Xt+1−Xt,−A}, (1)

where progress (towards minus infinity) larger than is cut. By construction (almost surely111We use almost surely although the property is deterministic, simply to disambiguate from in distribution and in expectation.)

 Y\scaleobj0.7At+1−Y\scaleobj0.7At≥−A, (2) Xt≤Y\scaleobj0.7At, (3)

where the latter equation holds since .

As a direct consequence of inequality (3), for , the hitting time of to reach is upper bounded by the hitting time of to reach , i.e., . Hence an upper bound on the hitting time of results in an upper bound on the hitting time of . Exploiting this idea, we derive an upper bound on the hitting time in the following theorem based on bounding the drift of the truncated process .

###### Theorem 1 (Upper bound via drift on truncated process).

Let be a sequence of real-valued random variables adapted to a filtration with . For let be the first hitting time of the set . If there exist such that is integrable, i.e. , and

 (4)

then the expectation of satisfies

 E[TXβ]≤E[TY\scaleobj0.7Aβ]≤x0−β+AB. (5)

Under slight misuse of terminology we define the truncated drift as the expected truncated one-step change as in equation (4).

###### Remark 1.

A drift on the truncated process also gives a drift on . Indeed, assume . Since

 Xt+1−Xt≤max{Xt+1−Xt,−A}=Y\scaleobj0.7At+1−Y\scaleobj0.7At,

if inequality (4) is satisfied then it holds

 E[Xt+1|Ft]−Xt≤−B. (6)

The next proposition ensures that the integrability of the truncated process is implied by the integrability of .

###### Proposition 1 (Integrability of the truncated process).

If a process is integrable, i.e., , then its truncated process defined in equation (1) is integrable as well.

Our lower bound also relies on an unbounded potential function. Typical drift theorems for establishing lower bounds assume that the potential is bounded and hence cannot be applied directly [9]. Instead we use the following theorem, the proof of which can be seen as a reformulation of the arguments used in [8, Theorem 2] as a drift theorem. It generalizes [9, Lemma 12]. Note that due to the more general setting we lose a (bearable) factor of four in the bound.

###### Theorem 2.

Let be integrable and adapted to such that

 X0=x0,Xt+1≤Xt,andE[Xt+1|Ft]−Xt≥−C

for . For we define . Then the expected hitting time is lower bounded by

 E[TXβ]≥x0−β4C−12.

## 3 Probability of Successes with Positive Progress Rate

In this section we derive properties of the success probability that will be central for establishing the drift condition for the upper bound. For an improvement rate and in we define the success probability with rate given as

 psuccr,d(m,σ)=Prx∼N(m,σ2I)(∥x∥<(1−r)⋅∥m∥)

i.e. as the probability that the norm of the offspring is smaller than . As a consequence of the isotropy of the multivariate normal distribution, this success probability equals

 psuccr,d(m,σ)=Pr(∥∥∥e1+σ∥m∥N∥∥∥<(1−r))

where and

is a standard normally distributed vector. This latter equation reveals that the probability of success with improvement rate

is a function of . Let us introduce the normalized step size and define

 psuccr,d(¯σ):=Pr(∥∥∥e1+¯σdN∥∥∥<(1−r)), (7)

then . For we recover the “classic” probability of success

 psucc0,d(¯σ):=Pr(∥∥∥e1+¯σdN∥∥∥<1).

The success probability function is illustrated in Figure 1.

We start by proving that the function is continuous, and for it is monotonically decreasing and hence bijective. This is formalized in the following lemma:

###### Lemma 1.
1. For all and , is positive and continuous.

2. For it is strictly monotonically decreasing and thus bijective.

3. For all , the image of is .

We now investigate the asymptotic limit of the function for to infinity.

###### Lemma 2.

For fulfilling the limit exists, and it equals . For , the function is continuous and strictly monotonically decreasing and the image of is .

For we recover the known result that the asymptotic limit of the probability of success (for ) equals [1]. The above lemma captures the intuition that success is maximized with a small step size (for , is maximal for ), while a non-trivial step size () is needed for making significant progress ().

## 4 Potential and Drift

In this section we define a potential function that gives rise to the unbounded and untruncated process from Section 2. First we establish that it satisfies the conditions of Theorem 1. Then we prove a drift condition for the lower bound. Finally we apply the drift theorems to obtain lower and upper bounds for the first hitting time of the (1+1)-ES. Our goal is to establish lower and upper bounds on the expected first hitting time of to the set , where is the logarithm of the target distance to the optimum. Linear or geometric convergence of (1+1)-ES—that is what we observe in simulation and what Jägersküpper found in his analysis with overwhelming probability—is implied if decreases at a linear rate towards . The potential function will be chosen so that its first hitting time gives an upper bound on the first hitting time of .

### 4.1 Potential Function

We fix two probabilities and such that . Since the probability of success function with rate is bijective (see Lemma 1), we know that there exist and such that and . We assume that and are chosen such that . Given these parameters, we define the potential function

 V(θ)=V(m,σ)=log(∥m∥) (8) +max{0,v⋅log(α⋅ℓ⋅∥m∥d⋅σ),v⋅log(α1/4⋅σ⋅du⋅∥m∥)}

with coefficient to be determined later. The potential function consists of three parts. The term measures optimization progress: when approaching the optimum, it decays to . The other terms become positive and hence active only if the step size is not well adapted. The second term in the maximum kicks in if is “too small”, and the third term turns positive if becomes “too large”. Hence the potential combines two ways of making progress, namely approaching the optimum and adapting the step size towards a regime where the (1+1)-ES can make significant optimization progress. The parameter relates these two types of progress by putting them on the same scale.

###### Lemma 3.

It holds . In other words, is integrable for each . Moreover, for all the truncated process defined in equation (1) with is integrable for each .

### 4.2 Truncated Drift

In the following, we prove that satisfies the prerequisites of Theorem 1. First we prove a proposition with a range of possible choices for the constants and . We then show in Proposition 3 how to set those constants to obtain the right scaling with respect to for the hitting time.

###### Proposition 2.

Consider optimization of the sphere function , with the (1+1)-ES. If the parameters and fulfill then the potential function defined in eq. (8) fulfills

 E[max{V(θt+1)−V(θt),−A}∣θt]≤−B, (9)

with

 B= min{A⋅p∗−54⋅v⋅log(α), v⋅log(α)⋅(5pℓ−14),v⋅log(α)⋅(1−5pu4)} (10)

and .

The previous proposition is the core component establishing the drift of the truncated process. The next proposition shows how to arrange the parameters so that the speed of the drift scales as desired in the limit of large dimensions.

###### Proposition 3.

Consider . For and with and it holds and .

Proposition 3 implies that the truncated drift (9) is of order .

### 4.3 Hit-and-Run

A very general lower bound on the expected first hitting time was established by Jägersküpper. His argumentation in [8, Theorem 2] is based on the hit-and-run algorithm. Here we use a similar approach for proving the lower bound. In iteration , given a mutation direction (with the notation of algorithm 1), the hit-and-run algorithm selects the optimal length of maintaining its direction and produces the offspring with . By construction, the progress of the hit-and-run method upper bounds the progress of the (1+1)-ES. Using the same realization for the Gaussian vector creating the offspring (see Algorithm 1), we indeed have:

 log(∥mt∥)−log(∥mt+1∥)≤log(∥mt∥)−log(∥x∗t∥). (11)

The log-progress of the hit-and-run on the sphere is

In the next lemma, we bound the expectation of its progress.

###### Lemma 4.

For , the expected log progress of the hit-and-run algorithm is upper bounded by .

Using inequality (11) we find that the expected log progress of the (1+1)-ES is upper-bounded by :

 E[log(∥mt∥)−log(∥mt+1∥)∣∣Ft]≤1d. (12)

### 4.4 Bounds on the First Hitting Time

Finally, all preparations are in place and we can reap the fruit of our labor, which are formulated in the following theorem. To this end, let be the first hitting time of by , where is defined in Algorithm 1.

###### Theorem 3.

The expected first hitting time of the (1+1)-ES starting from on the sphere function fulfills

 (log(∥m0∥)−log(ϵ))⋅d4−12≤E[Tϵ]≤V(θ0)−log(ϵ)+1dB

with defined in eq. (8) and given in eq. (10). With the choice of constants and given in Proposition 3, it is hence of the form

 E[Tϵ]∈Θ((log(∥m0∥)+log(1/ϵ))⋅d). (13)

The asymptotic form (13) of the expected first hitting time implies (i) that the process is akin to linear convergence due to the term , and (ii) a convergence rate of the form due to the factor in the expected hitting time.

## 5 Discussion and Conclusion

We have established the first non-asymptotic runtime bound for the first hitting time of the (1+1)-ES with one-fifth success rule (Algorithm 1) on the sphere function. Our proof is based on a global drift condition, a generic approach that has proven invaluable for the analysis of discrete algorithms. Our work shows that such approaches are a promising tool also for continuous domains. As usual in drift analysis, constructing the potential function and establishing drift conditions makes up the lion’s share of the efforts. In this sense, our drift theorems merely add convenience.

Establishing a drift condition is simplified in the stability analysis of the underlying Markov chain, since drift is needed only outside a compact set, i.e., for very small and very large normalized step size

. On the other hand, the current analysis is non-asymptotic and provides estimates of the convergence rate as a function of the problem dimension.

Jägersküpper established similar results already more than a decade ago, when drift analysis was only in its infancy. His results are hard to follow from a modern perspective. We improve on his work by proving non-asymptotic bounds for finite dimensions.

#### Acknowledgement

We gratefully acknowledge support by Dagstuhl seminar 17191 “Theory of Randomized Search Heuristics”. We would like to thank Per Kristian Lehre, Carsten Witt, and Johannes Lengler for valuable discussions and advice on drift theory.

## Appendix

###### proof of theorem 1.

We consider the truncated process defined above and the stopped truncated process as and . By construction it holds and . We will prove that

 E[Zt+1|Ft]≤Zt−B⋅1{TY\scaleobj0.7Aβ>t}. (14)

We start from

 E[Zt+1|Ft]=E[Zt+11{TY\scaleobj0.7Aβ≤t}|Ft]+E[Zt+11{TY\scaleobj0.7Aβ>t}|Ft] (15)

and estimate the different terms:

 (16)

where we have used that is -measurable, and this also implies that and , being functions of and , are -measurable. Also

 E[Zt+11{TY\scaleobj0.7Aβ>t}|Ft]=E[Yt+1|Ft]1{TY\scaleobj0.7Aβ>t}≤(Yt−B)1{TY\scaleobj0.7Aβ>t}=(Zt−B)1{TY\scaleobj0.7Aβ>t} (17)

where we have also used that is measurable. Hence injecting (16) and (17) into (15), we end up with (14). From (14), by taking the expectation we deduce

 E[Zt+1]≤E[Zt]−B⋅Pr[TA>t]. (18)

Following the same approach as [13, Theorem 1], since is a random variable taking values in , it can be rewritten as and thus it holds

 B⋅E[TY\scaleobj0.7Aβ]~t→∞⟵~t∑t=0B⋅Pr[TY\scaleobj0.7Aβ>t]≤~t∑t=0(E[Zt]−E[Zt+1])≤E[Z0]−E[Z~t]=x0−E[Z~t]. (19)

Since , then and given that , we deduce that for all , which implies

 E[TY\scaleobj0.7Aβ]≤x0−β+AB.

With this proves the upper bound. ∎

###### proof of proposition 1.

From the definition of the truncated process (1) we obtain which implies

 E[|Y\scaleobj0.7At+1|] ≤E[|Y\scaleobj0.7At|]+E[|Xt+1−Xt|]+A ≤E[|Y\scaleobj0.7At|]+E[|Xt+1|]+E[|Xt|]+A,

where the second to fourth terms are finite. Since is integrable, is integrable by induction. ∎

###### proof of theorem 2.

After iterations it holds . From Markov’s inequality we conclude and thus , which is equivalent to . Applying the Markov inequality once more we obtain

 E[TXβ]≥Pr(TXβ≥T)⋅T≥T/2≥x0−β4C−12.\qed

###### proof of lemma 1.

We introduce the sample through , or equivalently, . Defining , we write the success rate in the form

 psuccr,d(¯σ) =∫Ar,d(¯σ)φd(z)dz.

For increasing values of the ball-shaped integration area shrinks, and in case of it also moves away from the origin. Together with the monotonicity of w.r.t.  this proves that is monotonically decreasing. Continuity of follows from the boundedness of , and positivity from the fact that is non-empty and is positive. This proves the first claim. For the balls are nested. This immediately proves the second claim. From

 ⋂¯σ>0A0,d(¯σ)=∅%and⋃¯σ>0A0,d(¯σ)={z∈Rd∣∣mTz<0}

we conclude and , which proves the last claim. ∎

###### proof of lemma 2.

We consider the sequence of random variables

 Jd=1{∥∥e1+¯σdN∥∥2<(1−r)2} =1{1+2¯σdN1+¯σ2d2∥N∥2<1−2r+r2} =1{2¯σN1+¯σ2d∥N∥2<−2rd+r2d}

indexed by . Here denotes a standard normally distributed vector in , and

is its first component. Almost surely by the Law of Large Numbers,

converges to such that when goes to infinity then it holds

 limd→∞1{2¯σN1+¯σ2d∥N∥2<−2rd+r2d}=1{2¯σN1+¯σ2<−2ρ}=1{N1<−ρ¯σ−¯σ2}

almost surely. Since and converges almost surely to we need to prove the uniform integrability to ensure that the limit also holds in expectation. However the uniform integrability is here obvious since for all . Hence we have proven

 limd→∞E[Jd]=E[1{N1<−ρ¯σ−¯σ2}]=Pr(N1<−ρ¯σ−12¯σ)=Φ1[−ρ¯σ−12¯σ].\qed

###### proof of lemma 3.

The statement holds trivially for , since the initial condition is a constant. The following elementary calculation shows that the pole of the logarithm in the definition of  is not problematic. Let denote the open ball of radius one around the origin, then we have:

 ∫B(0,1)log(∥z∥)dz =∫10∫S(0,r)log(∥z∥)dzdr = ∫10(∫S(0,r)dz)log(r)dr =2⋅πd/2Γ(d/2)∫10rd−1log(r)dr = 2⋅πd/2Γ(d/2)⋅[rd(dlog(r)−1)d2]10 =−2⋅πd/2Γ(d/2)⋅d2,

where denotes the Gamma function. Therefore for all , and the statement follows by induction. The integrability of the truncated process is straight-forward from the above statement and Proposition 1. ∎

###### proof of proposition 2.

For the sake of simplicity we introduce . We rewrite the potential function as

 V(mt, σt)= log(∥mt∥) +v⋅log+(α⋅ℓ⋅∥mt∥σt⋅d) (20) +v⋅log+(σt⋅dα−1/4⋅u⋅∥mt∥). (21)

We want to estimate the conditional expectation

 E[max{V(θt+1)−V(θt),−A}∣θt]. (22)

We partition the possible values of into three sets: first the set of such that ( is small), second the set of such that ( is large), and last the set of such that (reasonable ). In the following, we bound eq. (22) for each of the three cases and in the end our bound will equal the minimum of the three bounds obtained for each case.

Reasonable case: . The potential function at time can be written as

 V(θt+1)=log(∥mt+1∥) +v⋅log(α⋅ℓ⋅∥mt+1∥d⋅σt+1)1{αℓ∥mt+1∥>d⋅σt+1}1{σt+1>σt} +v⋅log(α⋅ℓ⋅∥mt+1∥d⋅σt+1)1{αℓ∥mt+1∥>d⋅σt+1}1{σt+1<σt} +v⋅log(d⋅α1/4⋅σt+1u⋅∥mt+1∥)1{α−1/4u∥mt+1∥