In the theoretical study of EAs, a fundamental question is how fast can an EA find an optimal solution to a problem? In discrete optimization, this can be measured by the number of generations (hitting time) or the number of fitness evaluations (running time) when an EA finds an optimal solution [1, 2]. However, computation time is seldom applied to continuous optimization. Unlike discrete optimization, computation time is normally infinite in continuous optimization because the optimal solution set of a continuous optimization problem is usually a zero-measure set. In order to apply computation time into continuous optimization, the optimal solution must be replaced by a -neighbour of the optimal solution set [3, 4, 5] which forms a positive-measure set.
In continuous optimization, the performance of EAs is often evaluated by the convergence rate. Informally, the convergence rate question is how fast converges to 0? where is a distance between the th generation population and the optimal solution(s) . A lot of theoretical work discussed this topic from different perspectives [6, 7, 8, 9, 10, 11], however convergence metrics studied in theory are seldom adopted in practice. This motivates us to design a practical convergence metric satisfying two requirements: feasible in calculation and rigours in theory.
Our work emphasizes the convergence rate in terms of the approximation error. The approximation error is to evaluate the solution quality of EAs. Let denote the fitness of the best individual in population , its expected value , and the fitness of the optimal solution. The approximate error  is . In the context of , the convergence rate question is how fast converges to ? It is straightforward to derive the geometric convergence from the condition .
An alternative convergence metric is the error ratio between two generations (or one-generation convergence rate): . This ratio works well in deterministic iterative algorithms. But unfortunately, it is not appropriate to EAs because the calculation of is numerically unstable.
A remedy to the deficiency of the two-generation error ratio is to consider its average over consecutive generations. Then the geometric average convergence rate (ACR) is proposed by He and Lin , which is
From the ACR, it is straightforward to draw an exact expression of the approximation error: . More importantly, the calculation of is more stable than in computer simulation.
For discrete optimization, it has been proven  under random initialization, converges to a positive; and under particular initialization, always equals to this positive.
The current paper extends the analysis of the ACR from discrete optimization to continuous optimization. However, the extension is not trivial due to completely different probability measures in discrete and continuous spaces. There are two essential changes in the extension.
The analyses are different. In continuous optimization, an EA is modeled by a Markov chain in a continuous state space, rather than a Markov chain in a finite state space. Thus the matrix analysis used in  cannot be applied to continuous optimization.
The results are different. For continuous optimization, Theorem 1 in this paper claims that given a convergent EA modelled by an homogeneous Markov chain, its ACR converges to 0 if its generator is invariant or converges to a positive if its generator is positive-adaptive. But for discrete optimization, Theorem 1 in  states that for all convergent EAs modelled by homogeneous Markov chains, their ACR converges to a positive.
Ii Related Work
The convergence rate of EAs has been investigated from different perspectives and in varied terms.
Rudolph  proved under the condition , the sequence converges in mean geometrically fast to , that is, for some . For a superset of the class of quadratic functions, sharp bounds on the convergence rate is obtained.
Rudolph  compared Gaussian and Cauchy mutation on minimizing the sphere function in terms of the rate of local convergence, , where
denotes the Euclidean norm. He proved the rate is identical for Gaussian and spherical Cauchy distributions, whereas nonspherical Cauchy mutations lead to slower local convergence.
Beyer  developed a systematic theory of evolutionary strategies (ES) based on the progress rate and quality gain. The progress rate measures the distance change to the optimal solution in one generation, . The quality gain is the fitness change in one generation, , where is the fitness mean of individuals in population . Recently Beyer et al. [15, 16] analyzed dynamics of ES with cumulative step size adaption and ES with self-adaption and multi-recombination on the ellipsoid model and derived the quadratic progress rate. Akimoto et al. investigated evolution strategies with weighted recombination on general convex quadratic functions and derived the asymptotic quality gain. However, Auger and Hansen  argued the limits of the predictions based on the progress rate.
Auger and Hansen  developed the theory of ES from a new perspective using stability of Markov chains. Auger  investigated the -SA-EA on the sphere function and proved the convergence of based on Foster-Lyapunov drift conditions. Jebalia et al.  investigated convergence rate of the scale-invariant (1+1)-ES in minimizing the noisy sphere function and proved a log-linear convergence rate in the sense that: for some as . Auger and Hansen  further investigated the comparison-based step-size adaptive randomized search on scaling-invariant objective functions and proved as , for some . This log-linear convergence is an extension of the average rate of convergence in deterministic iterative methods .
is the probability distribution ofand a stationary probability distribution. Based on the Doeblin condition, they obtained bounds on for some . He and Yu  also derived lower and upper bounds on where denotes the probability of entering in a -neighbour of .
This paper develops Rudolph’s early work  which showed the geometrical convergence of but didn’t provide a method to quantify the convergence rate. We take as a practical metric to measure the geometric convergence and make a rigorous analysis.
Iii Definitions and Practical Usage
A continuous minimization problem is to
where is a continuous function defined on a closed set . Denote . We assume the optimal solution set to the above problem is a finite set.
is a vector inand a population is a vector in . A general framework of elitist EAs for solving optimization problems is described in Algorithm 1. Two types of genetic operators are employed in the algorithm. One is the generation operator to generate new individuals from a population such as mutation or crossover. The other is the selection operator to select individuals from a population. Any non-elitist EA can be modified into an equivalent elitist EA through adding an archive individual which preserves the best found solution but does not get involved in evolution. Thereafter we only consider elitist EAs.
The fitness of population is and the approximation error of is . The sequence is called convergent in mean if and convergent almost sure if .
For elitist EAs, if the sequence converges in mean, then it converges almost sure.
The ACR is to evaluate the average convergence speed of EAs for consecutive generations . The following definition is applicable to both elitist and non-elitist EAs.
Let and . The geometric average convergence rate (ACR) of an EA for generations is
If for some , let for any .
In (3), the term represents a geometric average of the reduction factor over generations. normalizes the average in the interval . The ACR can be regarded as the speed of convergence while the error as the distance from the optimal set. If , then the speed is positive and ; if , then the speed is zero and ; if (never happens in elitist EAs), then the speed is negative and . Like the speed of light, the speed of convergence has an upper limit, that is, .
Iii-B Practical Usage of Average Convergence Rate
The ACR provides a simple method to numerically measure how fast an EA converges. This is the main purpose of the ACR. In practice, the expected value is replaced by a sample mean of over runs of the EA. The ACR is calculated in four steps :
run an EA for times;
calculate the fitness sample mean :
where denotes the fitness at the -th run;
calculate the approximate error: ;
finally, calculate the ACR:
According to the Law of Large Numbers, it holdsand as .
An example is given to show the usage of the ACR in computer simulation. The aim is a comparison of two EAs on two benchmark functions in terms of the ACR. The benchmarks are the 2-dimensional sphere and Rastrigin functions:
The minimal point to both functions is with . Two EAs are variants of (1+1) elitist EAs (Algorithm 2) which adopt Gaussian mutation:
where is the parent, the child and a Gaussian random vector obeying the probability distribution
There are two ways to set the variance.
Invariant-: is set to a constant for all . In computer simulation, set .
Adaptive-: takes varied values on different . In computer simulation, set .
For the sake of terms, the EA using invariant- mutation is called an invariant EA and the EA using adaptive- mutation is called an adaptive EA.
In the experiment, the initial solution . The times of running an EA is . The maximum number of generations is .
The ACR quantifies the speed of convergence. Table I shows that the ACR value of the adaptive EA is much larger than that of the invariant EA on both and .
Fig. 1 illustrates the trend of . The ACR of the adaptive EA tends to stabilize at some positive value, while the ACR of the invariant EA is in a decreasing tendency. This phenomenon will be strictly analyzed later.
Iii-C Discussion of Other Convergence Metrics
A good convergence metric should satisfy two requirements: feasible in calculation and rigorous in analysis. We discuss two common convergence metrics and show they don’t satisfy the requirements.
The ratio is a popular convergence metric used in deterministic iterative algorithms which quantifies the reduction ratio of for one iteration. Fig. 2 illustrates the value for the adaptive EA on . fluctuates greatly. The calculation of is sensitive and unstable due to . Therefore, it is not a practical metric to measure the convergence rate of EAs.
The logarithmic scale, , probably is the most widely used convergence metric in comparing the convergence speed of EAs in practice. Fig. 3 displays the value of adaptive and invariant EAs on . When using for comparing the speed of convergence of two EAs, it is necessary to visualize in a figure and compare the slop of via observation. Fig. 3 shows that the slop of of the adaptive EA is sharper than the invariant EA. However, an observation is an observation, not an analysis. The slop might be taken as a convergence metric. But like , the calculation of is sensitive and unstable in computer simulation.
Summarizing the above discussion, we conclude that both and are not appropriate as a convergence metric.
Iv General Analyses
Iv-a Transition Probabilities
An EA is determined by its operators: generator and selection. In mathematics, both can be represented by transition probabilities.
Let denote the set consisting of all populations. A population is represented by a capital letter such as . The th generation population is represented by which is a random vector. A population satisfying is called an optimal population, and the collection of all optimal populations is denoted as .
Given a contraction factor and a population , the set can be divided into two disjoint subsets:
The set is called a -promising region and especially when , the set is called a promising region.
The generation of via is denoted as . It can be characterized by a probability transition. Given a population and a population set , the transition probability kernel is defined as
Similarly, the selection operation, , can be described by a probability transition too. Given any population and a population set , its transition probability kernel is defined as
where is a transition probability density function.
A one-generation update of population, , is described by a probability transition. Given any population and a population set , its transition probability kernel is defined as
where is a transition probability density function.
Generally, the operators of generating new individuals may be classified into two categories.
Let be the probability function depicting the generation transition from to .
Landscape-invariant: a generator is called landscape-invariant if and
is a multivariate random variable whose joint probability distribution is independent on. Here represents .
We assume the density function
is continuous and bounded, such as Cauchy and Gaussian distributions.
Landscape-adaptive: otherwise, a generator is called landscape-adaptive.
A landscape-invariant generator generates candidate solutions subject to the same probability distribution no matter where a parent population locates. An example is the invariant- Gaussian mutation described in Algorithm 2. A landscape-adaptive generator adjusts the probability distribution according to the position of a parent population. An example is the adaptive- Gaussian mutation in Algorithm 2.
For the landscape-invariant generator, the lemma below states that the infinum of the transition probability to the promising region equals to zero.
If the number of optimal solutions is finite and the generator is landscape-invariant, then the transition probability to the promising region satisfies
where is the abbreviation of mathematical infimum.
In order to prove (11), it is sufficient to prove
That is, , (where the set ), it holds
For a Lebesgue-measurable set , let denote its Lebesgue measure. Because is a continuous and bounded function, the probability of falling in a small area is small (where is fixed but is random). More strictly, , (set ), and , it holds
Because the number of optimal solutions is finite (then ) and is continuous, for the set , we may choose sufficiently small so that .
Because is continuous, we may choose sufficiently small so that and : . This implies the promising region .
According to (14) and , , we have
Because , we have
The above inequality is our wanted result.
Iv-B Analysis of Landscape-invariant Generators
For elitist EAs using landscape-invariant generators, Theorem 1 below indicates that the limit of the ACR is .
In order to prove it is sufficient to prove that , equivalently, According to the definition of limit, it is sufficient to prove that , ,
From Lemma 1, the sequence converges almost surely to 0, that is, . Denote
For the set , it holds
and for the set , we know that for the given , , then , it holds
From (18) we know
Then we obtain ,
While , we know there exists a positive :
Theorem 1 states that for EAs using landscape-invariant generators, the limit of their ACR is 0 as . This implies that landscape-invariant generators are not appropriate for solving continuous optimization problems.
Theorem 1 may not hold if the Lebesgue measure of is positive. However, for most continuous optimization problems, is a zero-measure set.
Iv-C Analysis of Landscape-adaptive Generators
Landscape-adaptive generators can be split into two types:
positive-adaptive: a landscape-adaptive generator is called positive-adaptive if , the transition probability to the -promising region satisfies
zero-adaptive: a landscape-adaptive generator is called zero-adaptive if the transition probability to the promising region satisfies
The zero-adaptive generator is bad adaptation because it causes a zero-valued ACR. (23) includes two cases:
. The analysis of this case is similar to Theorem 1. Then .
such that . When an EA starts from , for all and then .
However, a positive-adaptive generator is always good adaptation because it ensures that the limit of the ACR is positive.
From (9), we know that for any ,
It follows that , and for any ,
So we get
Let . It holds that
Theorem 2 indicates if an EA employs a positive-adaptive generator, then it converges to the optimal set with a positive ACR. How to design a generator satisfying the positive-adaptive condition (22) is important. An example is Rechenberg’s 1/5th success rule for controlling the mutation strength used in evolutionary strategies . From a theoretical viewpoint, Theorems 1 and 2 together confirm the necessity of using adaptive generators in continuous optimization.
Iv-D Analysis of Elitist EAs Not Convergent in Mean to 0
The analysis of this kind of EAs is rather simple. The theorem below states that the limit of the ACR is 0.
If the sequence does not converge to 0, then
Due to elitist selection, the sequence is monotonic decreasing with . According to the monotone convergence theorem, The condition says , thus . Then
V Case Studies
V-a 2-D Sphere function
Consider minimization of the 2-dimensional (2-D) sphere function.
The optimal solution is with .
Since the mutation obeys the Gaussian probability distribution (8), its probability density function is
Recalling that the sphere function is symmetric about the origin of coordinates, we set
Since the selection is elitist, the parent can be replaced by a child only if falls in the promising region . For problem (26), the promising region is the circle centred at with a radius . So,
If is a constant, then the mutation is landscape-invariant. When the (1+1) EA converges to the optimal solution, the radius converges to 0. As a result, the value of (28) also converges to 0 since is a constant. This means that (12) in Lemma 2 is true. According to Theorem 1, converges to when .
In order to obtain a positive ACR, the generator should be positive-adaptive, that is, , , ,
In order to ensure a positive lower bound on , we choose an adaptive . Denote . From (28), we get
If is bounded below by a constant , then
Take as a function of defined in the interval . Obviously is continuous. That is, , such that
for all in . Setting and , we know that the generator is positive-adaptive with the contractor factor for .
This case study shows the applicability of our theory to uni-modal functions and confirms the importance of using an adaptive even for the sphere function. Moreover, practical EAs such as evolutionary programming and evolution strategies always adopt adaptive for a faster convergence speed.
V-B 2-D Rastrigin Function
Consider minimization of the 2-D Rastrigin function:
where The optimal solution is with . The 2-D function is a sum of two 1-D Rastrigin functions as
The (1+1) elitist EA (Algorithm 2) is used to solve this minimization problem. Assume that is the parent at the -th generation at the fitness level . Since the selection is elitist, the parent is replaced by a child only if falls in the promising region .
Consider the partial derivative
Because is a periodic function with values restricted in , all solutions to equation (31) are located in . So, the 2-D Rastrigin has only finite global/local optimal solutions. , the promising region is decomposed into finite mutually disjoint subsets (let denote the number of subsets):