Congress of Evolutionary Computation
|DE-a||Adaptive Variant of DE|
|LHS||Latin Hypercube Sampling|
|PSO||Particle Swarm Optimization|
|PSO-w||PSO with an Inertia Weight|
Many real-world optimization problems are very complex, subject to multiple nonlinear constraints. Such nonlinearity and multimodality can cause difficulties in solving these optimization problems. Both empirical observations and numerical simulations suggest that the final solution may depend on the initial starting points for multimodal optimization problems (Yang et al., 2018; Eskandar et al., 2012). This is especially true for gradient-based methods. In addition, for problems with non-smooth objective functions and constraints, gradient information may not be available. Hence, most traditional optimization methods struggle to cope with such challenging issues. A good alternative is to use metaheuristic optimization algorithms, such as particle swarm optimization (PSO) and cuckoo search (CS). These metaheuristic optimizers are gradient-free optimizers, which do not require any prior knowledge or rigorous mathematical properties, such as continuity and smoothness (Yang et al., 2018; Li et al., 2016).
In the past decade, various studies have shown that these metaheuristic algorithms are effective in solving different types of optimization problems, including noisy and dynamic problems (Yang et al., 2018; Sun et al., 2019; Fan et al., 2018; Cheng et al., 2018). For example, engineering design problems can be solved by an improved variant of the PSO (Isiet and Gadala, 2019) and the connectivity of the internet of things (IoT) can be enhanced by a multi-swarm optimization algorithm (Hasan and Al-Rizzo, 2019). In addition, the optimized energy consumption model for smart homes can be achieved by differential evolution (DE) (Essiet et al., 2019), while the optimal dam and reservoir operation can be achieved by a hybrid of the bat algorithm (BA) and PSO (Yaseen et al., 2019). A fuzzy-driven genetic algorithm (Jacob et al., 2009) was used to solved a sequence segmentation problem, and a fuzzy genetic clustering algorithm was used to solve a dataset partition problem (Nguyen and Kuo, 2019).
Almost all algorithms for optimization require some forms of initialization, where some educated guess or random initial solutions are generated. Ideally, the final optimal solutions found by algorithms should be independent on their initial choices. This is only true for a few special cases such as linear programs and convex optimization; however, a vast majority of problems are not linear or convex, thus such dependency can be a challenging issue. In fact, most algorithms will have different degrees of dependency on their initial setting, and the actual dependency can be problem-specific and algorithm-specificYang (2014); Kondamadugula and Naidu (2016). For large-scale and multimodal problems, the effect of initialization is more obvious, and many algorithms may show differences in the probability of finding global optima on different initialization (Elsayed et al., 2017).
However, it still lacks a systematical study of initialization and how the initial distributions may affect the performance of algorithms under a given set of problems. The good news is that researchers start to realize the importance of initialization and have started to explore other possibilities with the aim to increase the diversity of the initial population (Yang, 2014). For example, based on the guiding principle of covering the search space as uniformly as possible, some studies have preliminarily explored certain ideas of different initialization methods, including quasi-random initialization (Kimura and Matsumura, 2005; Ma and Vandenbosch, 2012; Kazimipour et al., 2014; Maaranen et al., 2004), chaotic systems (Gao and Liu, 2012; Alatas, 2010), anti-symmetric learning methods (Rahnamayan et al., 2008), and Latin hypercube sampling (Ran et al., 2017; Zaman et al., 2016)
. In some cases, these studies have improved the performance of algorithms such as PSO and genetic algorithms (GA), but there are still some serious issues. Specifically, quasi-random initialization is simple and easy to implement, but it suffers from the curse of dimensionality(Maaranen et al., 2004); for chaos-based approaches, random sequences are generated by a few chaotic maps and fewer parameters (initial conditions), but they can inevitably have very sensitive dependence upon their initial conditions under certain conditions (dos Santos Coelho and Mariani, 2008). In addition, in the anti-symmetric learning method, twice the number of the population as the solution cohorts are used so as to select the solutions for the next generation, which doubles the computational cost. Though the Latin hypercube sampling is very effective at low dimensions, its performance can deteriorate significantly for higher-dimensional problems. We will discuss this issue in more detail later in this paper.
On the other hand, some researchers attempted to design some specific type of initialization in combination with a certain type of algorithm so as to solve a particular type of problems more efficiently. For example, Kondamadugula et al. (Kondamadugula and Naidu, 2016)2015) applied knowledge-based initialization to improve the performance of the genetic algorithm for solving the traveling salesman problem; Li et al. (Li et al., 2019) used the degrees of nodes to initialization for network disintegration problem, and Puralachetty et al. (Puralachetty and Pamula, 2016) proposed a two-stage initialization approach for a PID controller tuning in a coupled tank-liquid system. However, these approaches do have some drawbacks. Firstly, such initialization requires sophisticated allocation of points, which may not be straightforward to implement and can thus increase the computational costs. Secondly, they may be suitable only for a particular type of problems or algorithms. Thirdly, such initialization is largely dependent on the experience of the user. Finally, there is no mathematical guidance about the ways of initialization in practice.
This motivates us to carry out a systematic study of different initialization methods and their effects on the algorithmic performance. The choice of 22 probability distributions are based on rigorous probability theory with the emphasis on different statistical properties. In addition, we have used five different metaheuristic optimization algorithms for this study, and they are differential evolution (DE), particle swarm optimization (PSO), cuckoo search (CS), artificial bee colony (ABC) algorithm and genetic algorithm (GA). There are over 100 different algorithms and variants in the literature(Yang et al., 2018; Eskandar et al., 2012; Zaman et al., 2016), it is not possible to compare a good fraction of these algorithms. Therefore, the choice of algorithm has to focus on different search characteristics and representativeness of algorithms in the current literature. Differential evolution is a good representative of evolutionary algorithms, while particle swarm optimization is considered as the main optimizer of swarm intelligence based algorithms. In addition, the cuckoo search uses a long-tailed, Lévy flights-based search mechanism that has been shown to be more efficient in exploring the search space. Furthermore, artificial bee colony is used to represent the bee-based algorithms, while the genetic algorithm has been considered as a cornerstone for a vast majority of evolutionary algorithms.
Based on the simulations and analyses below, we can highlight the features and contributions of this paper as follows:
Numerical experiments show that, under the same condition of the maximum number of fitness evaluations(FEs), some algorithms require a large number of populations to reach the optimal solution, while others can find the optimal solution through multiple iterations under a small number of populations. In this paper, we make some recommendations concerning the number of the initial population and the maximum number of iterations of the five algorithms.
The initialization of 22 different probability distributions and their influence on the performance of the algorithm are studied systematically. It is found that some algorithms such as the differential evolution are not significantly affected by initialization, while others such as the particle swarm optimization are more sensitive to initialization. This may be related to the design mechanisms of these algorithms themselves, which is also an important indicator to measure the robustness of algorithms.
For the five algorithms under consideration, we have used a statistical ranking technique, together with a correlation test, to gain insight into the appropriate initialization methods for given benchmark functions.
Therefore, the rest of this paper is organized as follows. Section 2 briefly introduces the fundamentals of the three metaheuristic optimizers with some brief discussions of the other two optimizers, followed by the discussion of motivations and details of initialization methods in Section 3. Experimental results are presented in Section 4, together with the comparison of different initialization methods on some benchmark functions, including commonly used benchmarks and some recent CEC functions. Further experiments concerning key parameters of different algorithms are also carried out. Then, Section 5 discusses the correlation between the distributions of the initial population and their corresponding final solutions. Finally, Section 6 concludes with discussions about further research directions.
2 Metaheuristic Optimizers
Though traditional optimization algorithms can work well for local search, metaheuristic optimization algorithms have some main advantages for global optimization because they usually treat the problem as a black-box and thus can be flexible and easy to use (Yang, 2014). Furthermore, such optimizers do not have strict mathematical requirements (e.g., differentiability, smoothness), so they are suitable for problems with different properties, including discontinuities and nonlinearity. Various studies have shown their effectiveness in different applications (Yang, 2014; Aljarah et al., 2020; Yin et al., 2019).
The initialization of a vast majority of metaheuristic optimization algorithms has been done by using uniform distributions. Although this approach is easy to implement, empirical observations suggest that uniform distributions may not be the best option in all applications. It is highly needed to study initialization systematically using different probability distributions. As there are many optimization algorithms, it is not possible to study all of them. Thus, this paper will focus on five algorithms: differential evolution (DE), particle swarm optimization (PSO), cuckoo search (CS), artificial bee colony (ABC) and genetic algorithm (GA). These algorithms are representative, due to the different search mechanisms and their richer characteristics.
2.1 Differential Evolution
Differential evolution (DE) is a representative evolutionary and heuristic algorithm(Storn and Price, 1997)2005). Though differential evolution has a strong global search capability with a relatively high convergence rate for unimodal problems, the performance of DE can depend on its parameter setting. For highly nonlinear problems, its convergence rate can be low. To overcome such limitations, various mutation strategies and adaptive parameter control for have been proposed to improve its performance(Zhang and Sanderson, 2009). In the DE algorithm, each individual is a candidate solution or a point in the -dimensional search space, and the -th individual can be represented as
. In essence, different mutation strategies typically generate a mutation vectorby modifying the current solution vector in different ways.
Crossover is another strategy of modifying a solution. For example, the binomial crossover is a component-wise modification, controlled by a crossover parameter , which takes the following form:
where is the -th dimension of the -th individual solution. The updated vector can be expressed as after the mutation step, and corresponds to the -th dimension of the -th individual after crossover.
Among various variants of DE, Qin et al. (Qin et al., 2009) proposed a self-adaptive DE (SaDE) variant with four mutation strategies in its pool, which can be selected at different generations by a given criterion. More specifically, according to the success and failure of each mutation, a fixed learning period (LP) was used to update the probability of each mutation strategy being selected for the next generation. In addition,
was drawn from a normal distribution with a mean of
and standard deviation of; that is . Similarly, was drawn from a normal distribution , where was calculated from previous LP generations. Though the performance of SaDE was good, its complexity had increased.
For the ease of implementation and comparison in this paper, we use a simplified adaptive DE (DE-a). Based on the idea of the SaDE algorithm, a simple adaptive DE (DE-a) algorithm is proposed in this paper. In the mutation pool, we use five mutation strategies as follows:
DE/rand/1 (Storn and Price, 1997)
DE/current-to-best/1 (Zhang and Sanderson, 2009)
where is a parameter for mutation strength, and is the -th dimension of the current best solution. Here, , , , and represent 5 different individuals, which are selected randomly from the current population.
Both parameters and are initialized to a set of discrete values. That is, and . The current mutation strategy and parameter settings are not updated if better solutions are found during the iterations. Otherwise, mutation strategies and parameters are randomly selected from the above sets or ranges. Our simplified variant becomes easier to implement and the performance is much better than the original DE, as observed from our simulations later. Therefore, we will use this variant for later simulations.
2.2 Particle Swarm Optimization
Particle swarm optimization (PSO) is a well-known swarm intelligence optimizer with good convergence (Clerc and Kennedy, 2002), which is widely used in many applications (Kennedy and Eberhart, 2011). However, it can have premature convergence for some problems, and thus various variants have been developed to remedy it with different degrees of improvement. Among different variants, an improved PSO with an inertia weight (PSO-w), proposed by Shi and Eberhart (Shi and Eberhart, 1998), is efficient and its main steps can be summarized as the following update equations:
where and are the velocity vector and position vector, respectively, for particle at iteration . Here, is the individual best solution of -th individual in the previous iterations, and is the best solution of the current population. In Eq. (7), and are the two learning parameters, while and are two random numbers at the current iteration, drawn from a uniform distribution. In a special case when the inertia weight , this variant becomes the original PSO.
The value of can affect the convergence rate significantly. If is large, the algorithm can have a faster convergence rate, but it can easily fall into local optima, leading to premature convergence. Studies showed that a dynamically adjusted with iteration can be more effective. That is
where represents the maximum number of iterations, and are the minimum inertia weight and the maximum inertia weight, respectively. we will use PSO-w in the later experiments.
2.3 Cuckoo Search
Cuckoo search (CS) algorithm is a metaheuristic algorithm, developed by Xin-She Yang and Suash Deb (Yang and Deb, 2009), which was based on the behavior of some cuckoo species and their interactions with host species in terms of brooding parasitism. CS also uses Lévy flights instead of isotropic random walks, which can explore large search spaces more efficiently. As a result, CS has been applied in many applications such as engineering design (Gandomi et al., 2013)2011), semantic Web service composition (Chifu et al., 2011), thermodynamic calculations (Bhargava et al., 2013) and so on.
Briefly speaking, the CS algorithm consists of two parts: local search and global search. The current individual is modified to a new solution by using the following global random walk:
where is a factor controlling step sizes, and is the step size. is a random vector drawn from a Lévy distribution (Yang, 2014). That is
Here, ‘’ means that is drawn as a random-number generator from the distribution on the right-hand of the equation. is the Gamma function, while is a parameter. One of the advantages of using Lévy flights is that it has a small probability of long jumps, which enables the algorithm to escape from any local optima and thus increases its exploration capability (Yang et al., 2018; Viswanathan et al., 1999). The local search is mainly carried out by
where is the Heaviside function. This equation modifies the solution using two other solutions and . Here, the random number is drawn from a uniform distribution and is the step size. A switching probability is used to switch between these two search mechanisms, intending to balance global search and local search.
2.4 Other Optimizers
There are other optimizers that can be representative for the purpose of comparison. The genetic algorithm (GA) has been a cornerstone of almost all modern evolutionary algorithms, which consists of crossover, mutation and selection mechanisms. The GA has a wide range of applications such as pattern recognition (Pal and Wang, 2017), neural networks and control system optimization (Back and Schwefel, 1996) as well as discrete optimization problems (Guerrero et al., 2017). The literature on this algorithm is vast, thus we will not introduce it in detail here.
On the other hand, the artificial bee colony (ABC) algorithm was inspired by foraging behaviour of honey bees (Karaboga, 2005), and this algorithm has been applied in many applications (Li et al., 2017; Gao et al., 2018, 2019). A multi-objective version also exists (Xiang et al., 2015). Due to the page limit, we will not introduce this algorithm in detail. Readers can refer to the relevant literature (Karaboga and Basturk, 2007).
We will use the above five algorithms in this paper for different initialization strategies.
3 Initialization Methods
The main objective of this paper is to investigate different probability distributions for initialization and their effects on the performance of the algorithms used.
3.1 Motivations of this work
Both existing studies and empirical observations suggest that initialization can play an important role in the convergence speed and accuracy of certain algorithms. A good set of initial solutions, especially, when the initial solutions that are near the true optimality by chance, can reduce the search efforts and thus increase the probability of finding the true optimality. As the location of the true optimality is unknown in advance, initialization is largely uniform in a similar manner as those for Monte Carlo simulations. However, for problems in higher dimensions, a small initial population may be biased and could lie sparsely in unpromising regions. In addition, the diversity of the initial population is also important, and different distributions may have different sampling emphasis, leading to different degrees of diversity. For example, some studies concerning genetic algorithms have shown some effects of initialization (Burke et al., 2004; Chou and Chen, 2000).
Many initialization methods such as the Latin hypercube sampling (LHS) in the literature are mainly based on the idea of uniform spreading in the search space. They are easy to implement and can work well sometimes. For example, the two-dimensional landscape of the Bukin function is shown in Fig. 1. When the search space is in the area of , the PSO-w algorithm with an initial population obeying a uniform distribution can find the optimal solution in a few iterations. The distribution of the particles is shown in Fig. 2. For comparison, another run with an initial beta distribution has also been carried out as shown in Fig. 3. Specifically, the indicates the real optimal solution at (-10,1), while the dots show the locations of the current population and () indicates the best solution in current population. Fig. 2 shows the initial population with a uniform distribution in the search domain, while these population converged near the optimal solution after 5 iterations by the PSO-w algorithm, as shown in Fig. 2 where the current best solution of the population is close to the real optimal solution. However, the initial population (as shown in Fig. 3) drawn from a beta distribution could fall into a local optimum after 5 iterations as shown in Fig. 3. This clearly shows the effect and importance of initialization.
For the above function, initialization by a uniform distribution seems to give better results. However, for another function, uniform distributions may give worse results, even though uniform distributions are widely used. As an illustrative example, the best solution of the Michalewicz function is in two-dimensional space at [2.20319,1.57049] (see Fig. 4). If the initialization was done by a uniform distribution, it can lead to premature convergence as shown in Fig. 5, while the initialization by a beta distribution can lead to the global optimal solution after 5 iterations as shown in Fig. 6. Clearly, this shows that uniform distributions are not the best initialization method for all functions. For the same algorithm (such as PSO-w), different initialization methods can lead to different accuracies for different problems. This suggests that different initialization methods should be used for different problems. We will investigate this issue further in a more systematically way.
In order to study the effect of initialization systematically, we will use a diverse range of different initialization methods such as Latin hypercube sampling and different probability distributions. We now briefly outline them in the rest of this section.
3.2 Details of initialization methods
Before we carry out detailed simulations, we now briefly outline the main initialization methods.
3.2.1 Latin hypercube sampling
Latin hypercube sampling (LHS) is a spatial filling mechanism. It creates a grid in the search space by dividing each dimension into equal interval segments, and then generates some random points within some interval. It utilizes ancillary variables to ensure that each of the variables to be represented is in a fully stratified feature space (McKay et al., 1979). For example, if three sample points are needed in a two-dimensional (2D) parameter space, the three points may have four location scenarios (shown in Fig. 7). Obviously, these three points can also be scattered in the diagonal subspace of the 2D search space.
In the LHS, a set of samples are distributed so that they can sparsely distribute in the search space so as to effectively avoid the problem of over aggregation of sampling points. Studies show that such sampling can provide a better spread than uniform distributions, but it does not show a distinct advantage for higher-dimensional problems. So we will investigate this issue further.
3.2.2 Beta distribution
A beta distribution is a continuous probability distribution over the interval (0,1). Its probability density function (PDF) is given by
where is the standard Gamma function. This distribution has two shape parameters () that essentially control the shape of the distribution. Its notation is usually written as . Its expected value is
and its variance is.
3.2.3 Uniform distribution
Uniform distributions are widely used in initialization, and a uniform distribution on an interval is given by
where and are the limits of the interval. Its expectation or mean is , and its variance is .
3.2.4 Normal distribution
Gaussian normal distributions are among the most widely used distributions in various applications, though they are not usually used in initialization. The probability density function of this bell-shaped distribution can be written as
with the mean of and the standard deviation . This distribution is often written as N() where its mean determines the central location of the probability curve and its standard deviation determines the spread on both sides of the mean (Yang, 2014; Kızılersü et al., 2018)
. Normal distributions can be approximated by other distributions and can be linked closely with other distributions such as the log-normal distribution, Student-distribution and -distribution.
3.2.5 Logarithmic normal distribution
Unlike the normal distribution, the Logarithmic normal distribution is an asymmetrical distribution. Its probability density function is
3.2.6 Exponential distribution
An exponential distribution is asymmetric with a long tail, and its probability density function can be written as
where is a parameter. Its mean and standard deviation are and , respectively.
3.2.7 Rayleigh distribution
The probability density function of the Rayleigh distribution can be written as
whose mean and variance are and , respectively (Weik and Weik, 2001).
3.2.8 Weibull distribution
The Weibull distribution has a probability density function (Kızılersü et al., 2018)
where is a scale parameter, and is a shape parameter. This distribution can be considered as a generalization of a few other distributions. For example, corresponds to an exponential distribution, while leads to the Rayleigh distribution. Both its mean and variance are and , respectively.
Based on the above different probability distributions, we will carry out various numerical experiments in the rest of this paper.
4 Numerical Experiments
4.1 Experimental settings
In order to investigate the possible influence of different initialization methods on the five algorithms (PSO-w, DE-a, CS, ABC, GA), a series of experiments have been carried out first using a set of nine benchmark functions as shown in Table 1. The experiments will focus first on the PSO-w, DE-a and CS, and then similar tests will be carried out for the ABC and GA. These benchmark functions are chosen based on their different properties such as their modal shapes and numbers of local optima. More specifically, , , , and are continuous, unimodal functions, while , , , and are multimodal functions. For example, the global minimum of lies in a narrow, parabolic valley, which can be difficult for many traditional algorithms. Functions , , , and have many local minima that are widespread. The bowl-shaped function has local minima with only one global optimum, while the Easom function has several local minima, and its global minimum lies in a small area in a relatively large search space. In addition, we will use 10 more recent benchmarks from CEC2014 and CEC2017 to be discussed in detail later.
For a fair comparison, we have set the same termination condition for all the algorithms with the maximum number of function evaluations (FEs) of 600000, each algorithm with certain initialization has 20 independent runs. For all the test functions, the dimensionality is set to . As there are so many sets of data generated, we have summarized the results as the ‘Best’, ‘Mean’, ‘Var’ (variance) and ‘Dist’. Here, ‘Dist’ corresponds to the mean distance from the obtained solution to the true global optimal solution . That is
where denotes the total number of runs in each set of experiments. This distance metric not only measures the distance of the results, but also measures the stability of the obtained solutions.
For the algorithm-dependent parameters, after some preliminary parametric studies, we have set and to and , respectively, for DE-a. In the PSO-w, learning factors and are set to 1.5, and the inertia weight . For the CS, we have used and . In addition, the population size () will be varied so as to see if it has any effect on the results.
4.2 Influence of population size and number of iterations
Before we can compare different initialization methods in detail, we have to figure out if there is any significant effect due to the number of the population () used and the maximum number of iterations . Many studies in the existing literature used different population sizes and numbers of iterations (Akay and Karaboga, 2012). Though the total number of function evaluations for all functions and algorithms is set to 600 000, the maximum iteration will vary with . Obviously, a larger will lead to a smaller .
In order to make a fair comparison, all the algorithms are initialized by the same random initialization. Four functions with are selected randomly to reduce the computational efforts. We have carried out numerical experiments and the results are summarized in Tables 2 to 4.
Table 2 shows the experimental results of the DE-a algorithm with different and . When and , DE-a shows better performance in most cases. That means the accuracy of the DE-a algorithm depends more heavily on the number of iterations, and it manages to find the optimal solution with a small population size.
Table 3 summarizes the results for the PSO-w algorithm. We can see that the PSO-w algorithm performs well on the Rosenbrock, Rastrugin and Griewank functions when the size of population is 3000 and the number of iterations is 200. Only for the Sphere function, the PSO-w has the highest search accuracy when and . The results show that the accuracy of the PSO-w may depend more on its population size.
Table 4 shows that the CS algorithm has better performance under a small population and repeatedly iterations. Compared with DE, CS can find the optimal solution with a smaller size of population. This may be related to the design mechanism of the CS algorithm, which increases the diversity in the iteration process of the algorithm. This is one of the advantages of the CS algorithm.
Based on the above experiments, it is recommended that the population size and the number of maximum iterations be set as shown in Table 5. Thus, these parameter settings will be used in all the subsequent experiments.
4.3 Numerical results
In order to compare the possible effects of different initialization strategies for the first three algorithms (PSO-w, DE-a and CS), 22 different initialization methods have been tested, including 9 different distributions with different distribution parameters. As before, we have used different benchmarks with and have run each algorithm independently for 20 times. Tables 6, 7 and 8 show the comparison results of the ‘Best’, ‘Mean’, ‘Var’ and ‘Dist’ obtained by the three algorithms.