I Introduction
In the theoretical study of EAs, a fundamental question is how fast can an EA find an optimal solution to a problem? In discrete optimization, this can be measured by the number of generations (hitting time) or the number of fitness evaluations (running time) when an EA finds an optimal solution [1, 2]. However, computation time is seldom applied to continuous optimization. Unlike discrete optimization, computation time is normally infinite in continuous optimization because the optimal solution set of a continuous optimization problem is usually a zeromeasure set. In order to apply computation time into continuous optimization, the optimal solution must be replaced by a neighbour of the optimal solution set [3, 4, 5] which forms a positivemeasure set.
In continuous optimization, the performance of EAs is often evaluated by the convergence rate. Informally, the convergence rate question is how fast converges to 0? where is a distance between the th generation population and the optimal solution(s) . A lot of theoretical work discussed this topic from different perspectives [6, 7, 8, 9, 10, 11], however convergence metrics studied in theory are seldom adopted in practice. This motivates us to design a practical convergence metric satisfying two requirements: feasible in calculation and rigours in theory.
Our work emphasizes the convergence rate in terms of the approximation error. The approximation error is to evaluate the solution quality of EAs. Let denote the fitness of the best individual in population , its expected value , and the fitness of the optimal solution. The approximate error [12] is . In the context of , the convergence rate question is how fast converges to ? It is straightforward to derive the geometric convergence from the condition [6].
An alternative convergence metric is the error ratio between two generations (or onegeneration convergence rate): . This ratio works well in deterministic iterative algorithms. But unfortunately, it is not appropriate to EAs because the calculation of is numerically unstable.
A remedy to the deficiency of the twogeneration error ratio is to consider its average over consecutive generations. Then the geometric average convergence rate (ACR) is proposed by He and Lin [13], which is
(1) 
From the ACR, it is straightforward to draw an exact expression of the approximation error: . More importantly, the calculation of is more stable than in computer simulation.
For discrete optimization, it has been proven [13] under random initialization, converges to a positive; and under particular initialization, always equals to this positive.
The current paper extends the analysis of the ACR from discrete optimization to continuous optimization. However, the extension is not trivial due to completely different probability measures in discrete and continuous spaces. There are two essential changes in the extension.
The analyses are different. In continuous optimization, an EA is modeled by a Markov chain in a continuous state space, rather than a Markov chain in a finite state space. Thus the matrix analysis used in [13] cannot be applied to continuous optimization.
The results are different. For continuous optimization, Theorem 1 in this paper claims that given a convergent EA modelled by an homogeneous Markov chain, its ACR converges to 0 if its generator is invariant or converges to a positive if its generator is positiveadaptive. But for discrete optimization, Theorem 1 in [13] states that for all convergent EAs modelled by homogeneous Markov chains, their ACR converges to a positive.
Ii Related Work
The convergence rate of EAs has been investigated from different perspectives and in varied terms.
Rudolph [6] proved under the condition , the sequence converges in mean geometrically fast to , that is, for some . For a superset of the class of quadratic functions, sharp bounds on the convergence rate is obtained.
Rudolph [7] compared Gaussian and Cauchy mutation on minimizing the sphere function in terms of the rate of local convergence, , where
denotes the Euclidean norm. He proved the rate is identical for Gaussian and spherical Cauchy distributions, whereas nonspherical Cauchy mutations lead to slower local convergence.
Beyer [14] developed a systematic theory of evolutionary strategies (ES) based on the progress rate and quality gain. The progress rate measures the distance change to the optimal solution in one generation, . The quality gain is the fitness change in one generation, , where is the fitness mean of individuals in population . Recently Beyer et al. [15, 16] analyzed dynamics of ES with cumulative step size adaption and ES with selfadaption and multirecombination on the ellipsoid model and derived the quadratic progress rate. Akimoto et al.[17] investigated evolution strategies with weighted recombination on general convex quadratic functions and derived the asymptotic quality gain. However, Auger and Hansen [18] argued the limits of the predictions based on the progress rate.
Auger and Hansen [19] developed the theory of ES from a new perspective using stability of Markov chains. Auger [10] investigated the SAEA on the sphere function and proved the convergence of based on FosterLyapunov drift conditions. Jebalia et al. [20] investigated convergence rate of the scaleinvariant (1+1)ES in minimizing the noisy sphere function and proved a loglinear convergence rate in the sense that: for some as . Auger and Hansen [11] further investigated the comparisonbased stepsize adaptive randomized search on scalinginvariant objective functions and proved as , for some . This loglinear convergence is an extension of the average rate of convergence in deterministic iterative methods [21].
He, Kang and Ding [8, 22] studied the convergence in distribution where
is the probability distribution of
and a stationary probability distribution. Based on the Doeblin condition, they obtained bounds on for some . He and Yu [9] also derived lower and upper bounds on where denotes the probability of entering in a neighbour of .This paper develops Rudolph’s early work [6] which showed the geometrical convergence of but didn’t provide a method to quantify the convergence rate. We take as a practical metric to measure the geometric convergence and make a rigorous analysis.
Iii Definitions and Practical Usage
Iiia Definitions
A continuous minimization problem is to
(2) 
where is a continuous function defined on a closed set . Denote . We assume the optimal solution set to the above problem is a finite set.
An individual
is a vector in
and a population is a vector in . A general framework of elitist EAs for solving optimization problems is described in Algorithm 1. Two types of genetic operators are employed in the algorithm. One is the generation operator to generate new individuals from a population such as mutation or crossover. The other is the selection operator to select individuals from a population. Any nonelitist EA can be modified into an equivalent elitist EA through adding an archive individual which preserves the best found solution but does not get involved in evolution. Thereafter we only consider elitist EAs.Since population in Algorithm 1 only depends on and then only depends on , the population sequence is a Markov chain [8, 9].
Definition 1
The fitness of population is and the approximation error of is . The sequence is called convergent in mean if and convergent almost sure if .
Thanks to elitist selection, . Then the sequence is a supermartingale. According to Doob’s convergence theorem [23], for elitist EAs, convergence in mean implies almost sure convergence [6].
Lemma 1
For elitist EAs, if the sequence converges in mean, then it converges almost sure.
The ACR is to evaluate the average convergence speed of EAs for consecutive generations [13]. The following definition is applicable to both elitist and nonelitist EAs.
Definition 2
Let and . The geometric average convergence rate (ACR) of an EA for generations is
(3) 
If for some , let for any .
In (3), the term represents a geometric average of the reduction factor over generations. normalizes the average in the interval . The ACR can be regarded as the speed of convergence while the error as the distance from the optimal set. If , then the speed is positive and ; if , then the speed is zero and ; if (never happens in elitist EAs), then the speed is negative and . Like the speed of light, the speed of convergence has an upper limit, that is, .
IiiB Practical Usage of Average Convergence Rate
The ACR provides a simple method to numerically measure how fast an EA converges. This is the main purpose of the ACR. In practice, the expected value is replaced by a sample mean of over runs of the EA. The ACR is calculated in four steps [13]:

run an EA for times;

calculate the fitness sample mean :
(4) where denotes the fitness at the th run;

calculate the approximate error: ;

finally, calculate the ACR:
According to the Law of Large Numbers, it holds
and as .An example is given to show the usage of the ACR in computer simulation. The aim is a comparison of two EAs on two benchmark functions in terms of the ACR. The benchmarks are the 2dimensional sphere and Rastrigin functions:
(5)  
(6) 
The minimal point to both functions is with . Two EAs are variants of (1+1) elitist EAs (Algorithm 2) which adopt Gaussian mutation:
(7) 
where is the parent, the child and a Gaussian random vector obeying the probability distribution
(8) 
There are two ways to set the variance
.
Invariant: is set to a constant for all . In computer simulation, set .

Adaptive: takes varied values on different . In computer simulation, set .
For the sake of terms, the EA using invariant mutation is called an invariant EA and the EA using adaptive mutation is called an adaptive EA.
In the experiment, the initial solution . The times of running an EA is . The maximum number of generations is .
The ACR quantifies the speed of convergence. Table I shows that the ACR value of the adaptive EA is much larger than that of the invariant EA on both and .
generation  1  101  201  301  401  

adaptive  0.28  0.42  0.45  0.46  0.51  
invariant  0.10  0.11  0.07  0.05  0.04  
adaptive  0.23  0.20  1.00  1.00  1.00  
invariant  0.04  0.05  0.04  0.03  0.02 
Fig. 1 illustrates the trend of . The ACR of the adaptive EA tends to stabilize at some positive value, while the ACR of the invariant EA is in a decreasing tendency. This phenomenon will be strictly analyzed later.
IiiC Discussion of Other Convergence Metrics
A good convergence metric should satisfy two requirements: feasible in calculation and rigorous in analysis. We discuss two common convergence metrics and show they don’t satisfy the requirements.
The ratio is a popular convergence metric used in deterministic iterative algorithms which quantifies the reduction ratio of for one iteration. Fig. 2 illustrates the value for the adaptive EA on . fluctuates greatly. The calculation of is sensitive and unstable due to . Therefore, it is not a practical metric to measure the convergence rate of EAs.
The logarithmic scale, , probably is the most widely used convergence metric in comparing the convergence speed of EAs in practice. Fig. 3 displays the value of adaptive and invariant EAs on . When using for comparing the speed of convergence of two EAs, it is necessary to visualize in a figure and compare the slop of via observation. Fig. 3 shows that the slop of of the adaptive EA is sharper than the invariant EA. However, an observation is an observation, not an analysis. The slop might be taken as a convergence metric. But like , the calculation of is sensitive and unstable in computer simulation.
Summarizing the above discussion, we conclude that both and are not appropriate as a convergence metric.
Iv General Analyses
Iva Transition Probabilities
An EA is determined by its operators: generator and selection. In mathematics, both can be represented by transition probabilities.
Let denote the set consisting of all populations. A population is represented by a capital letter such as . The th generation population is represented by which is a random vector. A population satisfying is called an optimal population, and the collection of all optimal populations is denoted as .
Given a contraction factor and a population , the set can be divided into two disjoint subsets:
(9)  
(10) 
The set is called a promising region and especially when , the set is called a promising region.
The generation of via is denoted as . It can be characterized by a probability transition. Given a population and a population set , the transition probability kernel is defined as
where is a
transition probability density function
[24].Similarly, the selection operation, , can be described by a probability transition too. Given any population and a population set , its transition probability kernel is defined as
where is a transition probability density function.
A onegeneration update of population, , is described by a probability transition. Given any population and a population set , its transition probability kernel is defined as
where is a transition probability density function.
Generally, the operators of generating new individuals may be classified into two categories.
Definition 3
Let be the probability function depicting the generation transition from to .

Landscapeinvariant: a generator is called landscapeinvariant if and
is a multivariate random variable whose joint probability distribution is independent on
. Here represents .We assume the density function
is continuous and bounded, such as Cauchy and Gaussian distributions.

Landscapeadaptive: otherwise, a generator is called landscapeadaptive.
A landscapeinvariant generator generates candidate solutions subject to the same probability distribution no matter where a parent population locates. An example is the invariant Gaussian mutation described in Algorithm 2. A landscapeadaptive generator adjusts the probability distribution according to the position of a parent population. An example is the adaptive Gaussian mutation in Algorithm 2.
For the landscapeinvariant generator, the lemma below states that the infinum of the transition probability to the promising region equals to zero.
Lemma 2
If the number of optimal solutions is finite and the generator is landscapeinvariant, then the transition probability to the promising region satisfies
(11) 
where is the abbreviation of mathematical infimum.
Proof:
For a Lebesguemeasurable set , let denote its Lebesgue measure. Because is a continuous and bounded function, the probability of falling in a small area is small (where is fixed but is random). More strictly, , (set ), and , it holds
(14) 
Because the number of optimal solutions is finite (then ) and is continuous, for the set , we may choose sufficiently small so that .
Because is continuous, we may choose sufficiently small so that and : . This implies the promising region .
According to (14) and , , we have
(15) 
Because , we have
(16) 
The above inequality is our wanted result.
IvB Analysis of Landscapeinvariant Generators
For elitist EAs using landscapeinvariant generators, Theorem 1 below indicates that the limit of the ACR is .
Theorem 1
Proof:
In order to prove it is sufficient to prove that , equivalently, According to the definition of limit, it is sufficient to prove that , ,
(17) 
For the set , it holds
(19) 
and for the set , we know that for the given , , then , it holds
From (18) we know
Then we obtain ,
(20) 
While , we know there exists a positive :
(21) 
Theorem 1 states that for EAs using landscapeinvariant generators, the limit of their ACR is 0 as . This implies that landscapeinvariant generators are not appropriate for solving continuous optimization problems.
Theorem 1 may not hold if the Lebesgue measure of is positive. However, for most continuous optimization problems, is a zeromeasure set.
IvC Analysis of Landscapeadaptive Generators
Landscapeadaptive generators can be split into two types:

positiveadaptive: a landscapeadaptive generator is called positiveadaptive if , the transition probability to the promising region satisfies
(22) 
zeroadaptive: a landscapeadaptive generator is called zeroadaptive if the transition probability to the promising region satisfies
(23)
The zeroadaptive generator is bad adaptation because it causes a zerovalued ACR. (23) includes two cases:

. The analysis of this case is similar to Theorem 1. Then .

such that . When an EA starts from , for all and then .
However, a positiveadaptive generator is always good adaptation because it ensures that the limit of the ACR is positive.
Theorem 2
Proof:
Then
Then,
Let . It holds that
Theorem 2 indicates if an EA employs a positiveadaptive generator, then it converges to the optimal set with a positive ACR. How to design a generator satisfying the positiveadaptive condition (22) is important. An example is Rechenberg’s 1/5th success rule for controlling the mutation strength used in evolutionary strategies [25]. From a theoretical viewpoint, Theorems 1 and 2 together confirm the necessity of using adaptive generators in continuous optimization.
IvD Analysis of Elitist EAs Not Convergent in Mean to 0
The analysis of this kind of EAs is rather simple. The theorem below states that the limit of the ACR is 0.
Theorem 3
If the sequence does not converge to 0, then
Proof:
Due to elitist selection, the sequence is monotonic decreasing with . According to the monotone convergence theorem, The condition says , thus . Then
V Case Studies
Va 2D Sphere function
Consider minimization of the 2dimensional (2D) sphere function.
(26) 
The optimal solution is with .
The (1+1) elitist EA (Algorithm 2) is used to solve this problem. Let be the individual at the th generation and its child generated by the Gaussian mutation (7).
Since the mutation obeys the Gaussian probability distribution (8), its probability density function is
(27) 
Recalling that the sphere function is symmetric about the origin of coordinates, we set
Since the selection is elitist, the parent can be replaced by a child only if falls in the promising region . For problem (26), the promising region is the circle centred at with a radius . So,
(28) 
If is a constant, then the mutation is landscapeinvariant. When the (1+1) EA converges to the optimal solution, the radius converges to 0. As a result, the value of (28) also converges to 0 since is a constant. This means that (12) in Lemma 2 is true. According to Theorem 1, converges to when .
In order to obtain a positive ACR, the generator should be positiveadaptive, that is, , , ,
In order to ensure a positive lower bound on , we choose an adaptive . Denote . From (28), we get
If is bounded below by a constant , then
Take as a function of defined in the interval . Obviously is continuous. That is, , such that
for all in . Setting and , we know that the generator is positiveadaptive with the contractor factor for .
For any such that , according to Theorem 2, the limit of is a positive. A simple implementation is to let which is the setting in section IIIB.
This case study shows the applicability of our theory to unimodal functions and confirms the importance of using an adaptive even for the sphere function. Moreover, practical EAs such as evolutionary programming and evolution strategies always adopt adaptive for a faster convergence speed.
VB 2D Rastrigin Function
Consider minimization of the 2D Rastrigin function:
(29) 
where The optimal solution is with . The 2D function is a sum of two 1D Rastrigin functions as
(30) 
where
The (1+1) elitist EA (Algorithm 2) is used to solve this minimization problem. Assume that is the parent at the th generation at the fitness level . Since the selection is elitist, the parent is replaced by a child only if falls in the promising region .
Fig. (a)a shows the fitness landscape of the 2D Rastrigin function. Fig. (b)b illustrates the projection of the landscape at a fitness level to the decision plane.
Consider the partial derivative
(31) 
Because is a periodic function with values restricted in , all solutions to equation (31) are located in . So, the 2D Rastrigin has only finite global/local optimal solutions. , the promising region is decomposed into finite mutually disjoint subsets (let denote the number of subsets):
Comments
There are no comments yet.