In many practical optimization problems, the objective functions are hidden or too complicated to be analyzed. Under this kind of circumstances, direct optimization algorithms are appealing, which follows the trial-and-error style with some heuristics. Evolutionary algorithms (EAs) Bäck (1996)
are a large family of such algorithms. The family includes genetic algorithmsGoldberg (1989), evolutionary programming Koza (1994), evolutionary strategies Beyer and Schwefel (2002)
, and also covers other nature-inspired heuristics including particle swarm optimizationKennedy and Eberhart (1995), ant colony optimization Dorigo et al. (1996)
, estimation of distribution algorithmsLarrañaga and Lozano (2002), etc.
Theoretical studies of EAs have been developed rapidly in the recent decades, particularly noticeable of the blooming of running time analysis Neumann and Witt (2010); Auger and Doerr (2011); Jansen (2013). With the development of several analysis techniques (e.g. He and Yao (2001); Yu and Zhou (2008); Doerr et al. (2012); Sudholt (2013)), EAs have been theoretically investigated on problems from simple synthetic ones (e.g. Droste et al. (2002b)) to combinatorial problems (e.g. Scharnow et al. (2002)) as well as NP-hard problems (e.g. Yu et al. (2012)). During these analyses, effects of EAs components have been disclosed Yao (2012), including the crossover operators (e.g. Jansen and Wegener (2002); Lehre and Yao (2008); Doerr et al. (2013); Qian et al. (2013)), the population size (e.g. Jansen et al. (2005); Storch (2008); Witt (2008); Chen et al. (2009)), etc. Measures of the performance also have developed to cover the approximation complexity (e.g. He and Yao (2003); Friedrich et al. (2010); Yu et al. (2012); Lai et al. (2014)), the fixed-parameter complexity (e.g. Kratsch and Neumann (2013); Sutton and Neumann (2012)), the complexity under fixed-budget computation Jansen and Zarges (2012), etc. While most of these analyses studied instances of EAs on problem cases, general performance analysis may even be more desired, as the application of EAs is nearly unlimited. The famous No-Free-Lunch Theorem Wolpert and Macready (1997) used a quite general framework of EAs and gave a general conclusion that any two EAs are with the same performance (at least on discrete domains) given no prior knowledge of the problem distribution, of which the general running time is exponential Yu and Zhou (2008). When the complexity of a problem class is bounded, a general convergence lower bound can be derived for a class of EAs Fournier and Teytaud (2011). For more general EAs, the Black-Box model can derive the best possible performance Droste et al. (2002a); Anil and Wiegand (2009); Lehre and Witt (2012); Doerr and Winzen (2011). We have learned that a general performance analysis relies on a general framework of EAs.
It has been noticed that various implementations of EAs share a common structure that consists of a cycle of sampling and model building Zlochin et al. (2004). In this work, we propose to study the sampling-and-learning (SAL) framework. EAs commonly employ some heuristic to reproduce solutions, which is captured by the sampling step of SAL; and they also distinguish the quality of the reproduced solutions to guide the next sampling (e.g., genetic algorithms remove a portion of the worst solutions), which is captured by the learning step of SAL. The SAL framework can simulate a wide range of EAs as well as other heuristic search methods, by specifying the sampling and the learning strategies.
We evaluate this framework by the probable-absolute-approximate (PAA) query complexity. PAA complexity counts the number of fitness evaluations before reaching to an approximate solution with a probability, which is close to the intuitive evaluation of EAs in practice. We show that the SAL framework immediately admits a general PAA upper bound. For a specific version of SAL that uses classification algorithms, named the SAC algorithms, we obtain a tighter PAA upper bound by incorporating the learning theory results. Further comparing with the uniformly random search, we disclose that, under the error-target independence condition, SAC algorithms can polynomially reduce the complexity of the uniform search, but not super-polynomially; while the one-side-error condition further allows a super-polynomial improvement. This study shows that the classification error is an important effecting factor, which was not noticed before. We also notice that a good learning algorithm may not be necessary for a good SAL algorithm.
The rest of this paper is organized as follows: Section II introduces the SAL framework. In Section III, we compare the SAC algorithms, a specific version of the SAL framework, with the uniform search. Finally, Section IV concludes the paper.
2 The Sampling-and-Learning Framework
In this paper, we consider general minimization problems . We always denote as the whole solution space which an algorithm will search among. In the analysis of this paper, we consider is a compact set (in the Euclidean space, the compact set is equivalent to the bounded and closed set) and is a continuous function. Thus there must exist at least one solution such that . We use to denote sub-regions of and define . For the sake of convenience for the analysis, we assume without loss of generality that since is a bounded and closed set. Denote for any scaler ,
as the uniform distribution over, and
as the probability distributions. Besides, by, we mean the set of all polynomials with the related variables, and by , we mean the set of all functions that grow faster than any function in with the related variables. [Minimization Problem] A minimization problem consists of a continuous solution space and a continuous function , where and is a compact set. The goal is to find a solution such that for all .
Since is a compact set and is a continuous function, there must exist one solution such that . Namely, is bounded in . Therefore, in the rest of the paper, we assume without loss of generality that the value of is bounded in , i.e., . Given an arbitrary function with bounded value range over the input domain, the bound can be implemented by a simple normalization . Thus we assume in the rest of this paper that every minimization problem has its minimum value .
In real-world applications, we expect EAs to achieve some good enough solutions with a not quite small probability, which corresponds to approximation (e.g. Yu et al. (2012)) and probabilistic performance (e.g. Zhou et al. ). Combining the two, we study the probable-absolute-approximate (PAA) query complexity, which is the number of fitness evaluations that an algorithm takes before reaching an approximate quality, as defined in Definition 2. The PAA query complexity closely reflects our intuitive evaluation of EAs in practice.
[Probable-Absolute-Approximate Query Complexity] Given a minimization problem , an algorithm , and any as well as any approximation level , then the probable-absolute-approximate (PAA) query complexity is the number of calls to such that, with probability at least , finds a solution with .
2.1 The General Framework
Most EAs share a common trial-and-error structure with several important properties:
directly access the solution space, generate solutions, and evaluate the solutions;
the generation of new solutions depends only on a short history of past solutions;
both “global” and “local” heuristic operators are employed to generate new solutions.
We present a sampling-and-learning (SAL) framework in Algorithm 1 to capture these properties. The SAL framework starts from a random sampling in Step 1 like all EAs. Steps 2 and 13 record the best-so-far solutions throughout the search. SAL follows a cycle of learning and sampling stages. In Step 7, it learns a hypothesis (i.e., a mapping from to ) via the learning algorithm . Note that the learning algorithm allows to take the current data set , the last data set , and the last hypothesis into account. Different EAs may make different use of them. Step 8 initializes the sample set for the next iteration. The sample set can be initialized as an empty set, or to preserve some good solutions from the previous iteration. In Steps 9 to 12, it samples from the distribution transformed from the hypothesis as well as from the whole solution space balanced by a probability. The distribution implies the potential good regions learned by .
It should be noted that the SAL framework is not a concrete optimization algorithm but an abstract summary of a range of EAs, nor does the learning stage of the framework imply an accurate learning. We explain in the following how we could mimic several different EAs by the SAL framework. It is noticeable that the explanation is not a rigorous proof, but an intuitive illustration that the SAL framework can correspond to various implementations.
The genetic algorithms (GAs) Goldberg (1989)
deal with discrete solution spaces consisting of solutions represented as a vector of vocabulary. The element-wise mutation operator changes every element of a solution to a randomly selected word from the vocabulary with a probability. Converting this operation probability to the probability of generating a certain solution, letbe the probability of generating the solution from via the element-wise mutation, thus , where is the length of the solution, is the vocabulary size, is the Hamming distance, and is the probability of changing the element that is commonly . It is easy to calculate that is only when is a constant (and otherwise ). Given any set of solutions , we divide the search space into two sets that and . SAL can simulate the GA as that, for every population of the GA, SAL learns the hypothesis that circles the area , and uses as for solutions in . And for the area , SAL uses the uniform distribution to approximate the sampling with super-polynomially small probability. In this way, SAL can mimic the behavior of the GA. We have discussed a simplified GA. Most GAs also employ the crossover operators, which is a kind of local search operator and thus the resulting distribution can be compiled into the local distribution. Many GAs also employ a probabilistic selection, which can be simulated by selecting the initial solution set in the same way.
It has been argued that model-based search algorithms including the estimation of distribution algorithms (EDAs) Larrañaga and Lozano (2002), the ant colony optimization algorithms (ACOs) Dorigo et al. (1996), the cross-entropy method Rubinstein and Kroese (2004) can be unified in the sampling and model building framework Zlochin et al. (2004), which respectively correspond to the sampling and learning steps in the SAL framework. The particle swarm optimization algorithms (PSOs) Kennedy and Eberhart (1995) is particularly interesting since the simulation is perhaps the most sophisticated. A PSO algorithm maintains a set of “flying” particles each with a location (representing a solution) and a velocity vector. The location of a particle in the next iteration is determined by its current location and current velocity, and the velocity is updated by the current velocity and the locations of the “globally” and “personally” best particles. To simulate a PSO, a SAL algorithm needs to use the initial hypothesis resulting the same sampling distribution as that from the initial velocity. Let be an ordered set to contain the globally best particle and the personally best particles in Step 8. The learning algorithm in the SAL algorithm can be set to utilize the current data set and the last data set to recover the velocity, and utilize the last hypothesis and the globally and personally best particles recorded through to generate the new hypothesis that simulates the movement of particles in the PSO.
Overall, the SAL framework captures the trial-and-error structure as well as the global–local search balance, while leaving the details of the local sampling distribution being implemented by different heuristics.
The SAL framework directly admits a general upper bound of the PAA query complexity, as stated in Theorem 2.1.
For any minimization problem and any approximation level , with probability at least , a SAL algorithm will output a solution with using number of queried samples bounded from above by
where is the success probability of uniform sampling,
is the average success probability of sampling from the learnt hypothesis, is the required sample size realizing , and . is the initial sample size. In every iteration, we need samples to realize the probability (generally the higher the probability the larger the sample size, but it depends on the concrete implement of the algorithm), thus number of samples is naturally required. We prove the rest of the bound.
Let’s consider the probability that after iterations, the SAL algorithm outputs a bad solution such that . Since the is the best solution among all sampled examples, the probability is the intersection of events that every step of the sampling does not generate such a good solution.
1. For the sampling from uniform distribution over the whole solution space , the probability of failure is .
2. For the sampling from the learnt hypothesis according to the distribution , the probability of failure is denoted as .
Since every sampling is independent, we can expand the probability of overall failures, i.e., for any solution belongs to the all sampled examples,
where the first inequality is by for .
In order that , we let , which solves that .
2.2 The Sampling-and-Classification Algorithms
To further unfold the unknown term in Theorem 2.1, we focus on a simplified version of the SAL framework that employs a classification algorithm in the learning stage. We call this type of algorithms as the sampling-and-classification (SAC) algorithms. In the learning stage of a SAC algorithm, as described in Algorithm 2, the learning algorithm first uses a threshold to transform the data set into a binary labeled data set, and then invokes the classification algorithm to learn from the binary data set. is defined as . Note that SAC algorithms use the current data set in the learning algorithm, but not the last data set and the last hypothesis . Putting Algorithm 2 into the framework of Algorithm 1, we always set for SAC, and will be some distribution over the positive area of .
By these specifications, we can have a general PAA performance for SAC algorithms. According to Theorem 2.1, we need to estimate a lower bound of , i.e., how likely the distribution will lead to a good solution. Recall for any scaler . Denote for any hypothesis , as the uniform distribution over , and as the Kullback-Leibler (KL) divergence. KL-divergence measures how difference one distribution departs from another one. For probability distributions and
of two continuous random variables,, where and are the probability densities of and . Let denote the symmetric difference operator of two sets. We have a lower bound of the success probability as in Lemma 2.2.
For any minimization problem , any approximation level , any hypothesis , the probability that a solution sampled from an arbitrary distribution defined on will lead to a solution in is lower bounded as
Let denote the indicator function, namely, and . The proof starts from the definition of the probability,
where the last inequality is by Pinsker’s inequality.
We cannot pre-determine , but we know that is derived by a binary classification algorithm from a data set which is labeled according to the threshold parameter . For the binary classification, we know that the generalization error, which is the expected misclassification rate, can be bounded above by the training error, which is the misclassification rate in the seen examples, as well as the generalization gap involving the complexity of the hypothesis space Kearns and Vazirani (1994), as in Lemma 2. The is the VC-dimension measuring the complexity of .
[Kearns and Vazirani (1994)] Let be the hypothesis space containing a family of binary classification functions and , if there exist samples i.i.d. from according to some fixed unknown distribution , then, and , the following upper bound holds true with probability at least :
where is the expected error rate of over and is the error rate in the sampled examples from , and when ,
Again by Pinsker’s inequality, we know that the error under the distribution can be converted to the error under the uniform distribution, as
where we only take the event that the generalization inequality holds with probability into account. For simplicity, we denote the right-hand part as , which decreases with and , and increases with , , and .
We can use this inequality to eliminate the in Lemma 2.2. In every iteration of SAC algorithms, there are samples collected, which make the error of bounded.
For any minimization problem , any constant , and any approximation level , the average success probability of sampling from the learnt hypothesis of any SAC algorithm is lower bounded as
where is the sampling distribution at iteration , is the training error rate of , is the VC-dimension of the learning algorithm. By set operators,
where is the symmetric difference operator of two sets and is the expected error rate of under . The first inequality is by the triangle inequality, and the last equation is by that is contained in .
Since we can bound as
Now, we can apply Lemma 2.2, and the success probability of sampling from is lower bounded as
Substituting this lower bound and the probability of the generalization bound into obtains the theorem.
Combining Theorem 2.1 and Theorem 2.2 results an upper bound on the sampling complexity of SAC algorithms. Although the expression looks sophisticated, it can still reveal relative variables that generally effect the complexity. One could design various distributions for to sample potential solutions, however, without any a priori knowledge, the uniform sampling is the best in terms of the worst case performance. Meanwhile, without any a priori knowledge, a small training error at each stage from a learning algorithm with a small VC-dimension can also improve the performance.
3 SAC Algorithms v.s. Uniform Search
When EAs are applied, we usually expect that they can achieve a better performance than some baselines. The uniform search can serve as a baseline, which searches the solution space always by sampling solutions uniformly at random. In other words, the uniform search is the SAL algorithm with . In this section, we study the performance of SAC algorithms relative to the uniform search.
SAC algorithms will degenerate to uniform search if . Thus, it is easy to know that the PAA query complexity of uniform search is
Contrasting this with Theorem 1, we can find that how much a SAC algorithm improves from the uniform search depends on the average success probability that relies on the learnt hypothesis. A SAC algorithm is not always better than the uniform search. Without any restriction, can be zero and thus the SAC algorithm is worse. We are then interested in investigating the conditions under which SAC algorithms can accelerate from the uniform search.
3.1 A Polynomial Acceleration Condition
[Error-Target Independence] In SAC algorithms, for any and any approximation level , when sampling a solution from , the event and the event are independent.
We call SAC algorithms that are under the error-target independence condition as SAC algorithms. The condition is defined using the independence of random variables. From the set perspective, it is equivalent with
Under the condition, we can bound from below the probability of sampling a good solution, as stated in Lemma 3.1.
For SAC algorithms, it holds for all that
where is the expected error rate of under . For the numerator,
where the first equation is by , and the second equality is by the error-target independence condition.
For the denominator, we consider the worst case that all errors are out of and thus .
Similar to Theorem 2.2, we can bound from below the average success probability of sampling from the positive area of the learnt hypothesis,
We compare the uniform search with the SAC algorithms using uniform sampling within , i.e., , which is an optimistic situation. Then by Lemma 3.1,
By plugging , where is the expected error rate of under the distribution and ,
Note from Lemma 2 that, the convergence rate of the error is ignoring other variables and logarithmic terms from Lemma 2. We assume that SAC uses learning algorithms with convergence rate . We then find that such SAC algorithms cannot exponentially improve the uniform search in the worst case, as Proposition 3.1.
Using learning algorithms with convergence rate , and , with probability at least , if the query complexity of the uniform search is , the query complexity of SAC algorithms is also in the worst case. The query complexity of the uniform search being implies that
For the SAC
algorithms, if we ask the learning algorithm to produce a classifier with error rate, it will require number of samples in the worst case, so that the proposition holds. To avoid this, we can only expect the error rate to be in order to keep the query complexity at each iteration small.
Meanwhile, we can only have iterations otherwise we will have super-polynomial number of samples.
Following the optimistic case of Eq.(39), since , we consider one more optimistic situation that . Let . Even though, in the worst case that , we can have that
where it is noted that as long as the value of cannot affect the result. Then substituting into Theorem 2.1 obtains the total samples that proves the proposition.
The proposition implies that the SAC algorithms can face the same barrier as that of the uniform search. Nevertheless, the SAC algorithms can still improve the uniform search within a polynomial factor. We show this by case studies.
On Sphere Function Class:
Given the solution space , the Sphere Function class is where
Obviously, , is convex, and the optimal value is 0. It is important to notice that the volume of a -dimensional hyper-sphere with radius is , where , so that for any , where , since the radius leading to is .
Note that . It is straightforward to obtain that, minimizing any function in using the uniform search, the PAA query complexity with approximation level is, with probability at least ,
We assume is a learning algorithm that searches in the hypothesis space consisting of all the hyper-spheres in to find a sphere that is consistent with the training data, and meanwhile the sphere satisfies the error-target independence condition. Then a SAC algorithm using is a SAC algorithm. We simply assume that the search of the consistent sphere is feasible. Note that . For any , denote as the error rate of under the uniform distribution over and as the error rate of under the distribution , then it holds that
where and is the uniform distribution over . Let be the indicator function and be the area where makes mistakes. We split into