 # Average Convergence Rate of Evolutionary Algorithms II: Continuous Optimization

A good convergence metric must satisfy two requirements: feasible in calculation and rigorous in analysis. The average convergence rate is proposed as a new measurement for evaluating the convergence speed of evolutionary algorithms over consecutive generations. Its calculation is simple in practice and it is applicable to both continuous and discrete optimization. Previously a theoretical study of the average convergence rate was conducted for discrete optimization. This paper makes a further analysis for continuous optimization. First, the strategies of generating new solutions are classified into two categories: landscape-invariant and landscape-adaptive. Then, it is proven that the average convergence rate of evolutionary algorithms using landscape-invariant generators converges to zero, while the rate of algorithms using positive-adaptive generators has a positive limit. Finally, two case studies, the minimization problems of the two-dimensional sphere function and Rastrigin function, are made for demonstrating the applicability of the theory.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In the theoretical study of EAs, a fundamental question is how fast can an EA find an optimal solution to a problem? In discrete optimization, this can be measured by the number of generations (hitting time) or the number of fitness evaluations (running time) when an EA finds an optimal solution [1, 2]. However, computation time is seldom applied to continuous optimization. Unlike discrete optimization, computation time is normally infinite in continuous optimization because the optimal solution set of a continuous optimization problem is usually a zero-measure set. In order to apply computation time into continuous optimization, the optimal solution must be replaced by a -neighbour of the optimal solution set [3, 4, 5] which forms a positive-measure set.

In continuous optimization, the performance of EAs is often evaluated by the convergence rate. Informally, the convergence rate question is how fast converges to 0? where is a distance between the th generation population and the optimal solution(s) . A lot of theoretical work discussed this topic from different perspectives [6, 7, 8, 9, 10, 11], however convergence metrics studied in theory are seldom adopted in practice. This motivates us to design a practical convergence metric satisfying two requirements: feasible in calculation and rigours in theory.

Our work emphasizes the convergence rate in terms of the approximation error. The approximation error is to evaluate the solution quality of EAs. Let denote the fitness of the best individual in population , its expected value , and the fitness of the optimal solution. The approximate error  is . In the context of , the convergence rate question is how fast converges to ? It is straightforward to derive the geometric convergence from the condition  .

An alternative convergence metric is the error ratio between two generations (or one-generation convergence rate): . This ratio works well in deterministic iterative algorithms. But unfortunately, it is not appropriate to EAs because the calculation of is numerically unstable.

A remedy to the deficiency of the two-generation error ratio is to consider its average over consecutive generations. Then the geometric average convergence rate (ACR) is proposed by He and Lin , which is

 Rt=1−(ete0)1/t. (1)

From the ACR, it is straightforward to draw an exact expression of the approximation error: . More importantly, the calculation of is more stable than in computer simulation.

For discrete optimization, it has been proven  under random initialization, converges to a positive; and under particular initialization, always equals to this positive.

The current paper extends the analysis of the ACR from discrete optimization to continuous optimization. However, the extension is not trivial due to completely different probability measures in discrete and continuous spaces. There are two essential changes in the extension.

The analyses are different. In continuous optimization, an EA is modeled by a Markov chain in a continuous state space, rather than a Markov chain in a finite state space. Thus the matrix analysis used in  cannot be applied to continuous optimization.

The results are different. For continuous optimization, Theorem 1 in this paper claims that given a convergent EA modelled by an homogeneous Markov chain, its ACR converges to 0 if its generator is invariant or converges to a positive if its generator is positive-adaptive. But for discrete optimization, Theorem 1 in  states that for all convergent EAs modelled by homogeneous Markov chains, their ACR converges to a positive.

The paper is organized as follows: Section II introduces the related work. Section III defines the ACR. Section IV provides a general analysis of the ACR. Section V provides two case studies on the sphere function and Rastrigin function. Section VI concludes the paper.

## Ii Related Work

The convergence rate of EAs has been investigated from different perspectives and in varied terms.

Rudolph  proved under the condition , the sequence converges in mean geometrically fast to , that is, for some . For a superset of the class of quadratic functions, sharp bounds on the convergence rate is obtained.

Rudolph  compared Gaussian and Cauchy mutation on minimizing the sphere function in terms of the rate of local convergence, , where

denotes the Euclidean norm. He proved the rate is identical for Gaussian and spherical Cauchy distributions, whereas nonspherical Cauchy mutations lead to slower local convergence.

Beyer  developed a systematic theory of evolutionary strategies (ES) based on the progress rate and quality gain. The progress rate measures the distance change to the optimal solution in one generation, . The quality gain is the fitness change in one generation, , where is the fitness mean of individuals in population . Recently Beyer et al. [15, 16] analyzed dynamics of ES with cumulative step size adaption and ES with self-adaption and multi-recombination on the ellipsoid model and derived the quadratic progress rate. Akimoto et al. investigated evolution strategies with weighted recombination on general convex quadratic functions and derived the asymptotic quality gain. However, Auger and Hansen  argued the limits of the predictions based on the progress rate.

Auger and Hansen  developed the theory of ES from a new perspective using stability of Markov chains. Auger  investigated the -SA-EA on the sphere function and proved the convergence of based on Foster-Lyapunov drift conditions. Jebalia et al.  investigated convergence rate of the scale-invariant (1+1)-ES in minimizing the noisy sphere function and proved a log-linear convergence rate in the sense that: for some as . Auger and Hansen  further investigated the comparison-based step-size adaptive randomized search on scaling-invariant objective functions and proved as , for some . This log-linear convergence is an extension of the average rate of convergence in deterministic iterative methods .

He, Kang and Ding [8, 22] studied the convergence in distribution where

is the probability distribution of

and a stationary probability distribution. Based on the Doeblin condition, they obtained bounds on for some . He and Yu  also derived lower and upper bounds on where denotes the probability of entering in a -neighbour of .

This paper develops Rudolph’s early work  which showed the geometrical convergence of but didn’t provide a method to quantify the convergence rate. We take as a practical metric to measure the geometric convergence and make a rigorous analysis.

## Iii Definitions and Practical Usage

### Iii-a Definitions

A continuous minimization problem is to

 minf(→x),→x=(x1,⋯,xd)∈D⊂Rd, (2)

where is a continuous function defined on a closed set . Denote . We assume the optimal solution set to the above problem is a finite set.

An individual

is a vector in

and a population is a vector in . A general framework of elitist EAs for solving optimization problems is described in Algorithm 1. Two types of genetic operators are employed in the algorithm. One is the generation operator to generate new individuals from a population such as mutation or crossover. The other is the selection operator to select individuals from a population. Any non-elitist EA can be modified into an equivalent elitist EA through adding an archive individual which preserves the best found solution but does not get involved in evolution. Thereafter we only consider elitist EAs.

Since population in Algorithm 1 only depends on and then only depends on , the population sequence is a Markov chain [8, 9].

###### Definition 1

The fitness of population is and the approximation error of is . The sequence is called convergent in mean if and convergent almost sure if .

Thanks to elitist selection, . Then the sequence is a supermartingale. According to Doob’s convergence theorem , for elitist EAs, convergence in mean implies almost sure convergence .

###### Lemma 1

For elitist EAs, if the sequence converges in mean, then it converges almost sure.

The ACR is to evaluate the average convergence speed of EAs for consecutive generations . The following definition is applicable to both elitist and non-elitist EAs.

###### Definition 2

Let and . The geometric average convergence rate (ACR) of an EA for generations is

 Rt=1−(ete0)1/t=1−(t∏k=1ekek−1)1/t. (3)

If for some , let for any .

In (3), the term represents a geometric average of the reduction factor over generations. normalizes the average in the interval . The ACR can be regarded as the speed of convergence while the error as the distance from the optimal set. If , then the speed is positive and ; if , then the speed is zero and ; if (never happens in elitist EAs), then the speed is negative and . Like the speed of light, the speed of convergence has an upper limit, that is, .

### Iii-B Practical Usage of Average Convergence Rate

The ACR provides a simple method to numerically measure how fast an EA converges. This is the main purpose of the ACR. In practice, the expected value is replaced by a sample mean of over runs of the EA. The ACR is calculated in four steps :

1. run an EA for times;

2. calculate the fitness sample mean :

 fTt=1T(f(Xt)+⋯+f(X[T]t)), (4)

where denotes the fitness at the -th run;

3. calculate the approximate error: ;

4. finally, calculate the ACR:

According to the Law of Large Numbers, it holds

and as .

An example is given to show the usage of the ACR in computer simulation. The aim is a comparison of two EAs on two benchmark functions in terms of the ACR. The benchmarks are the 2-dimensional sphere and Rastrigin functions:

 minfS(→x)=x21+x22,→x∈R2, (5) minfR(→x)=20+∑2k=1(x2k−10cos2πxk), (6)

The minimal point to both functions is with . Two EAs are variants of (1+1) elitist EAs (Algorithm 2) which adopt Gaussian mutation:

 →y=→x+→z, (7)

where is the parent, the child and a Gaussian random vector obeying the probability distribution

 zi∼N(0,σi). (8)

There are two ways to set the variance

.

• Invariant-: is set to a constant for all . In computer simulation, set .

• Adaptive-: takes varied values on different . In computer simulation, set .

For the sake of terms, the EA using invariant- mutation is called an invariant EA and the EA using adaptive- mutation is called an adaptive EA.

In the experiment, the initial solution . The times of running an EA is . The maximum number of generations is .

The ACR quantifies the speed of convergence. Table I shows that the ACR value of the adaptive EA is much larger than that of the invariant EA on both and .

Fig. 1 illustrates the trend of . The ACR of the adaptive EA tends to stabilize at some positive value, while the ACR of the invariant EA is in a decreasing tendency. This phenomenon will be strictly analyzed later.

### Iii-C Discussion of Other Convergence Metrics

A good convergence metric should satisfy two requirements: feasible in calculation and rigorous in analysis. We discuss two common convergence metrics and show they don’t satisfy the requirements.

The ratio is a popular convergence metric used in deterministic iterative algorithms which quantifies the reduction ratio of for one iteration. Fig. 2 illustrates the value for the adaptive EA on . fluctuates greatly. The calculation of is sensitive and unstable due to . Therefore, it is not a practical metric to measure the convergence rate of EAs.

The logarithmic scale, , probably is the most widely used convergence metric in comparing the convergence speed of EAs in practice. Fig. 3 displays the value of adaptive and invariant EAs on . When using for comparing the speed of convergence of two EAs, it is necessary to visualize in a figure and compare the slop of via observation. Fig. 3 shows that the slop of of the adaptive EA is sharper than the invariant EA. However, an observation is an observation, not an analysis. The slop might be taken as a convergence metric. But like , the calculation of is sensitive and unstable in computer simulation.

Summarizing the above discussion, we conclude that both and are not appropriate as a convergence metric.

## Iv General Analyses

### Iv-a Transition Probabilities

An EA is determined by its operators: generator and selection. In mathematics, both can be represented by transition probabilities.

Let denote the set consisting of all populations. A population is represented by a capital letter such as . The th generation population is represented by which is a random vector. A population satisfying is called an optimal population, and the collection of all optimal populations is denoted as .

Given a contraction factor and a population , the set can be divided into two disjoint subsets:

 S(X,ρ)={Y∈S|e(Y)<ρe(X)}, (9) ¯¯¯¯S(X,ρ)={Y∈S|e(Y)≥ρe(X)}. (10)

The set is called a -promising region and especially when , the set is called a promising region.

The generation of via is denoted as . It can be characterized by a probability transition. Given a population and a population set , the transition probability kernel is defined as

 Pg(X;A)=∫Apg(X;Y)dY,

where is a

transition probability density function

.

Similarly, the selection operation, , can be described by a probability transition too. Given any population and a population set , its transition probability kernel is defined as

 Ps(X,Y;A)=∫Aps(X,Y;Z)dZ,

where is a transition probability density function.

A one-generation update of population, , is described by a probability transition. Given any population and a population set , its transition probability kernel is defined as

 P(X;A)=∫Ap(X;Y)dY,

where is a transition probability density function.

Generally, the operators of generating new individuals may be classified into two categories.

###### Definition 3

Let be the probability function depicting the generation transition from to .

1. Landscape-invariant: a generator is called landscape-invariant if and

is a multivariate random variable whose joint probability distribution is independent on

. Here represents .

We assume the density function

is continuous and bounded, such as Cauchy and Gaussian distributions.

A landscape-invariant generator generates candidate solutions subject to the same probability distribution no matter where a parent population locates. An example is the invariant- Gaussian mutation described in Algorithm 2. A landscape-adaptive generator adjusts the probability distribution according to the position of a parent population. An example is the adaptive- Gaussian mutation in Algorithm 2.

For the landscape-invariant generator, the lemma below states that the infinum of the transition probability to the promising region equals to zero.

###### Lemma 2

If the number of optimal solutions is finite and the generator is landscape-invariant, then the transition probability to the promising region satisfies

 inf{Pg(X,S(X,1));X∉X∗}=0, (11)

where is the abbreviation of mathematical infimum.

###### Proof:

In order to prove (11), it is sufficient to prove

 lime(X)→0Pg(X,S(X,1))=0. (12)

That is, , (where the set ), it holds

 Pg(X,S(X,1))<ε. (13)

For a Lebesgue-measurable set , let denote its Lebesgue measure. Because is a continuous and bounded function, the probability of falling in a small area is small (where is fixed but is random). More strictly, , (set ), and , it holds

 Pr(X+Z∈A)=∫Z:X+Z∈Apz(X+Z)dZ<ε. (14)

Because the number of optimal solutions is finite (then ) and is continuous, for the set , we may choose sufficiently small so that .

Because is continuous, we may choose sufficiently small so that and : . This implies the promising region .

According to (14) and , , we have

 Pr(X+Z∈A(X∗,δ))<ε. (15)

Because , we have

 Pg(X,S(X,1))≤Pr(X+Z∈A(X∗,δ))<ε. (16)

The above inequality is our wanted result.

### Iv-B Analysis of Landscape-invariant Generators

For elitist EAs using landscape-invariant generators, Theorem 1 below indicates that the limit of the ACR is .

###### Theorem 1

For Problem (2) and Algorithm 1, if the following conditions are true:

1. the number of optimal solutions is finite;

2. the sequence converges to 0;

3. the generator is landscape-invariant;

then

###### Proof:

In order to prove it is sufficient to prove that , equivalently, According to the definition of limit, it is sufficient to prove that , ,

 et−1−et<εet−1. (17)

From (13) in Lemma 2, we know , let , then , it holds

 Pg(X,S(X,1))<ε. (18)

From Lemma 1, the sequence converges almost surely to 0, that is, . Denote

 S1={ω∈S|limt→+∞e(Xt(ω))=0},
 S2={ω∈S|limt→+∞e(Xt(ω))≠0}.

For the set , it holds

 Pr(ω∈S2)=0, (19)

and for the set , we know that for the given , , then , it holds

 e(Xt−1(ω))<δ,∀ω∈S1.

From (18) we know

 Pg(X,S(Xt−1(ω),1))≤ε,ω∈S1.

Then we obtain ,

 E[e(Xt−1(ω))−e(Xt(ω))∣Xt−1(ω)]≤εe(Xt−1(ω)). (20)

While , we know there exists a positive :

 E[e(Xt−1(ω))−e(Xt(ω))∣Xt−1(ω)]≤B. (21)

Combining (19), (20) and (21) together, we get

 et−1−et = ∫S1E[e(Xt−1(ω))−e(Xt(ω))∣Xt−1(ω)]Pr(dω) +∫S2E[e(Xt−1(ω))−e(Xt(ω))∣Xt−1(ω)]Pr(dω) ≤ ε∫S1e(Xt−1(ω))Pr(dω)+B⋅0≤εet−1.

So (17) is true. Then we complete the proof.

Theorem 1 states that for EAs using landscape-invariant generators, the limit of their ACR is 0 as . This implies that landscape-invariant generators are not appropriate for solving continuous optimization problems.

Theorem 1 may not hold if the Lebesgue measure of is positive. However, for most continuous optimization problems, is a zero-measure set.

### Iv-C Analysis of Landscape-adaptive Generators

Landscape-adaptive generators can be split into two types:

1. positive-adaptive: a landscape-adaptive generator is called positive-adaptive if , the transition probability to the -promising region satisfies

 Cρ=inf{Pg(X;S(X,ρ));X∉X∗}>0. (22)
2. zero-adaptive: a landscape-adaptive generator is called zero-adaptive if the transition probability to the promising region satisfies

 inf{Pg(X;S(X,1));X∉X∗}=0. (23)

1. . The analysis of this case is similar to Theorem 1. Then .

2. such that . When an EA starts from , for all and then .

However, a positive-adaptive generator is always good adaptation because it ensures that the limit of the ACR is positive.

###### Theorem 2

For Problem (2) and Algorithm 1, if the following conditions are true:

1. the sequence converges to 0;

2. the generation operator is positive-adaptive with a contraction factor ;

then such that

###### Proof:

From (9), we know that for any ,

 S(Xk−1,ρ) ={Y∈S∣e(Y)≤ρe(Xk−1)}.

It follows that , and for any ,

 f(Xk−1)−f(Y)≥(1−ρ)(f(Xk−1)−f∗). (24)

So we get

 E[f(Xk−1)−f(Xk)|Xk−1] = ∫S(Xk−1,1)(f(Xk−1)−f(Y))pg(Xk−1;Y)dY ≥ ∫S(Xk−1,ρ)(f(Xk−1)−f(Y))pg(Xk−1;Y)dY ≥ ∫S(Xk−1,ρ)(1−ρ)(f(Xk−1)−f∗)pg(Xk−1;Y)dY(from (???)) = (1−ρ)(f(Xk−1)−f∗)Pg(Xk−1,S(Xk−1,ρ)) ≥ (1−ρ)Cρ(f(Xk−1)−f∗).(from (% ???)) (25)

Then

 ekek−1 =1−fk−1−fkfk−1−f∗ =1−E[E[f(Xk−1)−f(Xk)|Xk−1]]fk−1−f∗ ≤1−E[(1−ρ)Cρ(f(Xk−1)−f∗)]fk−1−f∗ ≤1−(1−ρ)Cρ.

Then,

 Rt=1−(ete0)1/t=1−(t∏k=1ekek−1)1/t≥(1−ρ)Cρ.

Let . It holds that

Theorem 2 indicates if an EA employs a positive-adaptive generator, then it converges to the optimal set with a positive ACR. How to design a generator satisfying the positive-adaptive condition (22) is important. An example is Rechenberg’s 1/5th success rule for controlling the mutation strength used in evolutionary strategies . From a theoretical viewpoint, Theorems 1 and 2 together confirm the necessity of using adaptive generators in continuous optimization.

### Iv-D Analysis of Elitist EAs Not Convergent in Mean to 0

The analysis of this kind of EAs is rather simple. The theorem below states that the limit of the ACR is 0.

###### Theorem 3

If the sequence does not converge to 0, then

###### Proof:

Due to elitist selection, the sequence is monotonic decreasing with . According to the monotone convergence theorem, The condition says , thus . Then

## V Case Studies

### V-a 2-D Sphere function

Consider minimization of the 2-dimensional (2-D) sphere function.

 minfS(→x)=x21+x22,→x=(x1,x2)∈R2. (26)

The optimal solution is with .

The (1+1) elitist EA (Algorithm 2) is used to solve this problem. Let be the individual at the -th generation and its child generated by the Gaussian mutation (7).

Since the mutation obeys the Gaussian probability distribution (8), its probability density function is

 pg(→x;→y)=12πσ1σ2exp{(y1−x1)22σ21+(y2−x2)22σ22}. (27)

Recalling that the sphere function is symmetric about the origin of coordinates, we set

 σ1=σ2=σ.

Since the selection is elitist, the parent can be replaced by a child only if falls in the promising region . For problem (26), the promising region is the circle centred at with a radius . So,

 Pg(→x;S(→x,1)) = 12πσ2∫→y∈S(→x,1)exp{−∑2i=1(yi−xi)22σ2}dy1dy2 = 12πσ2∫π2−π2dθ∫2rcosθ0rexp(−r22σ2)dr = 12−1πexp(−2r2σ2)∫π20exp(2r2sin2θσ2)dθ. (28)

If is a constant, then the mutation is landscape-invariant. When the (1+1) EA converges to the optimal solution, the radius converges to 0. As a result, the value of (28) also converges to 0 since is a constant. This means that (12) in Lemma 2 is true. According to Theorem 1, converges to when .

In order to obtain a positive ACR, the generator should be positive-adaptive, that is, , , ,

 Pg(→x;S(→x,ρ))≥C.

In order to ensure a positive lower bound on , we choose an adaptive . Denote . From (28), we get

 Pg(→x;S(→x,1)) = 12−1πexp(−2r2σ2)∫π20g(θ)dθ = 12−1πexp(−2r2σ2)(∫π40+∫π2π4)g(θ)dθ > 12−1πexp(−2r2σ2){g(π4)+g(π2)}π4 = 14{1−exp(−r2σ2)}.

If is bounded below by a constant , then

 Pg(→x;S(→x,1))>14{1−exp(−C20)}.

Take as a function of defined in the interval . Obviously is continuous. That is, , such that

 Pg(→x,S(→x,ρ))>Pg(→x,S(→x,1))−ε

for all in . Setting and , we know that the generator is positive-adaptive with the contractor factor for .

For any such that , according to Theorem 2, the limit of is a positive. A simple implementation is to let which is the setting in section III-B.

This case study shows the applicability of our theory to uni-modal functions and confirms the importance of using an adaptive even for the sphere function. Moreover, practical EAs such as evolutionary programming and evolution strategies always adopt adaptive for a faster convergence speed.

### V-B 2-D Rastrigin Function

Consider minimization of the 2-D Rastrigin function:

 minfR(→x)=20+2∑k=1(x2k−10cos2πxk), (29)

where The optimal solution is with . The 2-D function is a sum of two 1-D Rastrigin functions as

 fR(→x)=fR1(x1)+fR1(x2), (30)

where

The (1+1) elitist EA (Algorithm 2) is used to solve this minimization problem. Assume that is the parent at the -th generation at the fitness level . Since the selection is elitist, the parent is replaced by a child only if falls in the promising region .

Fig. (a)a shows the fitness landscape of the 2-D Rastrigin function. Fig. (b)b illustrates the projection of the landscape at a fitness level to the decision plane.