1.1 Motivation
The field of AI inevitably emerged as computers started to find their way in many applications. Engineers and computer scientists who worked on developing AI models, started using these models to solve many problems they faced. The success of many AI applications in computer science and computer engineering were exciting, it became possible to find the meaning of a word according to its context, computers and machines became able to understand spoken language to some extent, and recently were used to match fingerprints. Until recently, many of the AI models were developed to solve problems in computer engineering and computer science in the first place. These models were later used as is or slightly modified by researchers in other fields.
Control engineers and researchers were enthusiastic about the results obtained by their fellows in the computer science field. They were scrutinizing the AI models developed by their fellows because back in their labs, they faced complex problems unyielding to traditional mathematical techniques. Among these problems is the system identification problem. The problem of system identification with its hard nonlinearity, multimodality, and constraints is especially unsuitable for traditional mathematical techniques, and the results obtained using these techniques are unsatisfactory for most real life applications. Henceforth, control engineers started using the models developed by their fellows to solve system identification problems. The results obtained were encouraging, and many AI models became the method of choice for many control engineers.
The author of this thesis belongs to both groups of researchers. He was unsatisfied by the offtheshelf AI models used by control engineers, so he used his knowledge in both fields to test, modify, and develop new models with the problem of system identification in mind.
A new wave of AI models is the CI techniques, which are in most cases, are nature inspired, or biologically inspired techniques. Recent research have shown promising results in their applications in many control engineering problems, and specifically, system identification problems [1]. Due to their inherent capability of handling many of the difficulties encountered in control engineering problems, and because of the encouraging results reported by many researchers, it was found by the author of this thesis that developing these techniques is the next logical step to pursue.
1.2 Thesis Outline
After this brief introduction, the optimization problem with its various types is presented in Chapter 2. After the Single and Multiobjective problems are presented and their terminology and definitions are explained in Section 2.1 and 2.2, respectively, the difficulties faced in solving these problems are detailed in Section 2.3.
The thesis moves to nontraditional techniques by presenting the Evolutionary Algorithms (EAs) in Chapter 3. After an introduction in Section 3.1, the logic behind Evolutionary Algorithms and an analysis of their behavior is detailed in Section 3.2, and an example of these techniques, which is the GAs, presented in detail in Section 3.3. Finally, a proposed extension to the GA model is explained, tested on a set of benchmark problems, and a conclusion about its efficiency is given in Section 3.4.
More CI models follow in Chapter 4 where the Swarm Intelligence (SI) methods are presented. The chapter starts by an introduction then explains some theoretical aspects about these techniques in Section 4.2. An example of the SI techniques which is the PSO is presented followed by some of its variations in Section 4.3 and 4.4, respectively. A proposed modification to some aspects of the PSO model is explained, analyzed, tested on many test problems, and the results were furnished and followed by a conclusion in Section 4.5. Finally, a comparison between the GA and PSO models concludes the chapter.
Chapter 5 is concerned about the applications of CI techniques in control engineering, and particularly in system identification. It starts by outlining the advantages and disadvantages of using CI methods in control engineering in Section 5.1 and 5.2, respectively. Different applications in control engineering are presented in Section 5.3 and the problem of system identification is given in more detail in Section 5.4. The problem of parameter identification of an induction motor is explained, its model is driven, the algorithms used to solve it are presented, and the experiment is carried out with its results explained and a conclusion is furnished in Section 5.5.
Chapter 6 concludes this thesis by analyzing the results reached in previous chapters and proposes future research directions.
2.1 Singleobjective Optimization Problems
Singleobjective Optimization Problems, as their name implies, are problems that have single objective to be optimized (maximized or minimized) by varying their parameters. A SOP can be defined as follows [3, 4]:
Definition 1 (Singleobjective Optimization Problem).
optimize  (2.1)  
(2.2)  
(2.3)  
(2.4) 
where is a decision variable, x is a decision vector, X is the decision space, is an objective function, Y is the objective space and and are equality and inequality constraint functions respectively.
The constraint functions determine the feasible set.
Definition 2 (Feasible Set).
The feasible set is the set
of decision vectors x that satisfy the constraints
and .
The image of the feasible set in the objective space
is known as the feasible region.
A well known SOP is the Rastrigin test problem [5, 6]. It is defined as follows:
minimize  (2.5)  
s.t.  (2.6) 
where is the number of decision variables. Figure 2.2 shows the landscape of this problem for . The minimum value of the objective function in the feasible region is achieved at the global minimum, or more generally, the global optimum.
Definition 3 (Global Optimum).
A global optimum is a point in the feasible region whose value is better than all other points in that region
For the Rastrigin problem shown in Figure 2.2 the global minimum is achieved at with a value of . Excluding the valley where the global minimum lies at its bottom, there are 12 valleys in this problem’s landscape. The point at the bottom of each one of them is known as a local minimum, or more generally, a local optimum.
Definition 4 (Local Optimum).
A local optimum is a point in the feasible region whose value is better than all other points in its vicinity in the region, and is worse than the global optimum.
2.2 Multiobjective Optimization Problems
Unlike SOPs, MOPs have many objectives to be optimized concurrently, and most of the time these objectives are conflicting. A MOP can be defined as follows [3, 4]:
Definition 5 (Multiobjective Optimization Problem).
optimize  (2.7)  
(2.8)  
(2.9) 
where is a decision variable, x is a decision vector, X is the decision space, y is a vector of objective functions, Y is the objective space and and are equality and inequality constraint functions respectively.
minimize  (2.10)  
(2.11)  
(2.12) 
Figure 2.4 shows the values of the objective functions and while varying the decision variable value. and are monotonically decreasing with in the range , so the two objectives are in harmony [9], which means that an improvement in one of them is rewarded with simultaneous improvement in the other. The higher the value of in this range the better (the lower) the value of the two objectives become. A similar situation happens in the range . The two objective functions are in harmony and monotonically decreasing with in this range; the smallest possible value of in this range is translated to the best (the lowest) value for the two objectives in that range. However, The two objectives are in conflict in the range ; An increase in value is accompanied by improvement of and deterioration of .
A mapping of Figure 2.4 to the objective space gives Figure 2.4. The regions , and in Figure 2.4 are mapped to the upperleft dashed segment, solid segment and the lowerright dashed segment in Figure 2.4 respectively. It is clear that points d, e and f are in the harmony region, while points a, b and c are in the conflict region.
Using logical comparison, the following relationships are established: Objective vector b is better than objective vector d because although they have the same value, b has a lower value. Objective vector c is better than objective vector d as well; it has lower and values. The following definition is used to put these relationships in mathematical notation [4].
Definition 6.
For any two objective vectors and ,
u  iff  (2.13)  
u  iff  (2.14)  
u  iff  (2.15) 
In a minimization SOP, a solution r is better than a solution s, iff , where is the objective function. But in a MOP this comparison mechanism does not hold because there are more than one objective to be concurrently optimized. In Figure 2.4, and , but . This may seem illogical when using SOPs reasoning, but in MOPs this situation is quite common. The points and represent two different solutions yet none of them is superior to the other; although the solution represented by has lower value than that of , the solution represented by has lower value than that of . This means that a new relationship is needed to compare two different decision vectors a and b in a MOP when . This relationship is described using Pareto Dominance [4, 10, 11].
Definition 7 (Pareto Dominance).
For any two decision vectors a and b in a minimization problem without loss of generality,
a  iff  (2.16)  
a  iff  (2.17)  
a  iff  (2.18) 
Pareto Dominance is attributed to the Italian sociologist, economist and philosopher Vilfredo Pareto (1848–1923) [12]. It is used to compare the partially ordered solutions of MOPs, compared to the completely ordered solutions of SOPs. Using Pareto dominance to compare solutions represented in Figure 2.4, the following relationships are established; b dominates d because they have the same value, and b has a lower value, while b is indifferent to a because although b has a lower value, a has a lower value. The solutions represented by a, b and c are known as Pareto Optimal solutions. These solutions are optimal in the sense that none of their objectives can be improved without simultaneously degrading another objective.
Definition 8 (Pareto Optimality).
A decision vector is said to be nondominated regarding a set iff
(2.19) 
If it is clear within the context which set A is meant, it is simply left out. Moreover, x is said to be Pareto optimal iff x is nondominated regarding
By applying this definition to the example presented in Figure 2.4. It is clear that the solutions represented by , and are dominated, while , and are nondominated. The set of all Paretooptimal solutions is known as the Paretooptimal set, and its image in the objective space is known as the Paretooptimal Front (PF).
Definition 9 (Nondominated Sets and Fronts).
Let . The function gives the set of nondominated decision vectors in A:
(2.20) 
The set is the nondominated set regarding A, the corresponding set of objective vectors is the nondominated front regarding A. Furthermore, the set is called the Paretooptimal set and the set is denoted as the Paretooptimal front.
The Paretooptimal set of the example given in Equations (2.10)–(2.12) is , and its corresponding image in the objective space is the PF shown as a solid segment in Figure 2.4.
To further explain the Pareto dominance relationships, the example in Figure 2.5 is given where the two objectives and are to be minimized. The relationships given in this example show how the solution vector represented by the point at position (8,8) is seen by the other solution vectors. So, the objective space is divided by a vertical and a horizontal lines passing through point into four regions.

All the points in the first region at the northeast of point (including the borderlines) have higher or values, or both, so point is better than them and these points are dominated by point .

All the points in the second and fourth regions at the southeast and northwest of point (excluding the borderlines) have lower values of one objective and higher values for the other objective compared to the objective values of point . So, All the points in these two regions are indifferent to point . Note that although the point at position (1,9) is on the Pareto front while point is far from that front, they are indifferent to each other.

All the points on the third region at the southwest of point (including the borderlines) have lower or values, or both. Which means that they are better than point and they are dominating point .
2.3 Difficulties in Optimization Problems
A researcher or a problem solver could easily be overwhelmed by the great number of algorithms found in literature dealing with optimization. Some of them dated back to 1600 BC [13], while others are being developed as this text is being typed. These myriads of algorithms are developed to deal with different types of difficulties in optimization problems. The performance of each these optimizers depends on the characteristics of the problem it optimizes such as being linear/nonlinear, static/dynamic, SOP/MOP, combinatorial/discrete/continuous, types and number of constraints, size of the search space…etc. According to the No Free Lunch (NFL) theorems [14], all algorithms perform exactly the same when averaged over all possible problems. So, as much as possible knowledge about the problem should be incorporated in selecting the problem optimizer, because, according to NFL theorems, there is no such algorithm that will perform better than all other algorithms on all possible problems. The first step in understanding optimization problems’ characteristics is to answer the question: Why are some problems difficult to solve?[15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
2.3.1 The Search Space
The search space of a problem is the set of all possible solutions to that problem. To tune a radio set to a station, a reasonable man may work out an exhaustive search by scanning the entire available bandwidth in his set, but this reasonable man will never resort to exhaustive search to find the best medication for his heart, or to look up a word in a dictionary; For the medication case, the penalty of tryingout all possibilities is too high, it may lead to certain death, while exhaustively looking up a word in a dictionary takes a lot of time^{1}^{1}1Oxford English Dictionary contains over half a million words. To imagine how exhaustive search can easily be a laborious task, the following example is given.
A classic and one of the most used combinatorial optimization problems in
AI literature is the Boolean Satisfiability Problem (SAT) [26]. The problem is to make a compound statement of boolean variables evaluate to TRUE [15]. For example, find the truth assignment for the variables that evaluate the following function to TRUE:(2.21) 
where is the complement of . In real world situations, such a function with 100 variables is a reasonable one, yet its search space is extremely huge; There are possible solutions for this problem. Given a computer that can test 1000 solutions of this problem per second, and that it has started its trials at the beginning of time itself, 15 billion years ago, it would have examined less than one percent of all possibilities by now [15].
The SAT problem is a combinatorial optimization problem, which means that its search space contains a finite set of solutions, and a problem solver can theoretically test all of them. But the Rastrigin problem given earlier in this chapter is a problem with a continuous search space, which means there are an infinite number of possible solutions, there is no way to test them all.
Because most people are not willing to wait for a computer to exhaustively search for a solution to the given SAT problem, other search techniques have been devised to facilitate the search task. For the radio tuning example given earlier, one can make use of the feedback sound he gets from the radio to narrow down the search space and finetune the set. But for the given SAT problem this mechanism is useless because the feedback, which is the value of , is always FALSE except for a single solution resulting in TRUE output. Such problems are known as needle in a haystack problems [17] as shown in Figure 2.6, where equals zero in the range except for a single solution () where .
The landscape of the problem could be more difficult than the needle in a haystack case. It could be a misleading or deceptive landscape [27, 28, 29, 30]. In this situation the feedback (fitness value) drives the problem solver away from the global minimum, as shown in Figure 2.6. While searching this landscape for the global minimum, a higher value of is rewarded by a lower value of when , and a lower value of is rewarded by a lower value of when . This reward encourages the problem solver to move away from the global minimum at
Another type of difficulty regarding the search space is the landscape multimodality [31, 29, 32] or ruggedness [33, 34]. A rugged landscape may trap the problem solver in a local minimum, as shown in Figure 2.6
. For this minimization problem, a problem solver can easily get trapped in any of the local minima, and the odds are high that it will fall in another one if it managed to escape the first.
2.3.2 Modeling the Problem
The fist step in solving an optimization problem is building a model for it. After this step, the real problem is put aside and the problem solver becomes concerned with the model not the problem itself. So a solution of an optimization problem is a solution to its model [15], and an optimal solution to inaccurate model, is a right solution to the wrong problem. Inaccuracies arise from wrong assumptions and simplifications.
For example, a major airline company is deciding on its carriers destinations, flights schedule and pricing policy. In this complicated task, the company may assume that customers in Brazil and Argentina are willing to pay the same price for the same service because they have similar average national income. But this assumption is wrong because customers in Argentina enjoy a high quality flights with a competitive price on their national airline company. After the company has considered all factors to build a model for their problem, its highly likely that this model will be extremely complicated to be solved by most optimization tools. So the company is faced with one of two possibilities [15].

Find a precise solution to a simplified version of the model.

Find an approximate solution to the precise model.
The first method uses traditional optimization techniques, such as linear or dynamic programming, to find a solution to the approximate model. while the second method uses nontraditional techniques, such as EAs, PSO and Ant Colony Optimization (ACO), to find an approximate solution to the precise model.
2.3.3 Constraints
Another source of difficulty in optimization problems is the constraints imposed on the problem. At first glance, these constraints may be seen as an aid to the problem solver because they do limit the search space. But in many cases they become a major source of headache and make it hard to find a single feasible solution, let alone an optimal one. The following example borrowed from [15] is used for illustration:
A common problem found in all universities is making a timetable for classes offered to students each semester. The first step in solving this problem is collecting the required data, which includes, courses offered and students registered for them, professors teaching these courses and their assistances, tools required for instruction such as projectors, computers and special blackboards, available laboratories …etc. The next step is to define the hard constraints, which are constraints that must all be met by a solution to be a feasible solution. These constraints may include:

Every class must be assigned to an available room that has enough seats for all students and has all the tools required for students to carry experiments if any, and all special instruction tools.

Students enrolled in more than one course must not have their courses held at the same time.

Professors must not have two classes with overlapping time.

Classes must not start by 8 a.m. and must not end after 10 p.m.

Classes for the Fall semester must not start by September and must not end after December .
These constraints are hard in the sense that violating them will severely hinder the education process. However in addition to these hard constraints, there are soft constraints which a solution that violates any, some or even all of them would still be feasible, but it’s highly encouraged to satisfy them. These constraints may include:

Its preferable that undergraduate classes be held from 8 a.m. to 4 p.m., while postgraduate classes be held from 4 p.m. to 9 p.m.

Its preferable that lectures be held before their corresponding exercise classes.

Its preferable that students have more than 3 classes and less than 9 classes per day.

Its desirable not to roam students across opposite ends of the campus to take their classes.

Its preferable not to assign more than 5 classes for each professor per day.
Although these constraints are all soft constraints, some of them are more important than others. For example, its more important not to hold an exercise class before its corresponding lecture than having to make students go to opposite ends of campus. To achieve this, each constraint is assigned a weight that reflects its importance and acts as a penalty in the fitness function.
After all the constraints have been defined, the problem of finding a solution that meets all the hard constraints and optimizes the soft constraints becomes a challenging task indeed.
2.3.4 Change Over Time
Almost all realworld problems are dynamic problems, they change over time. The only thing fixed about them is the fact that they are in continuous change. A man neglecting this fact is like a stock exchange investor who assumes that share prices will remain fixed or a man who assumes that the weather will always be perfect for a picnic. To illustrate the significance of change over time, the stock investor example will be further extended.
Kevin is a stock exchange investor who must realize that the stock market is extremely dynamic. Share prices change every minute affected by many factors.
Some factors are hard to predict, for example, the share price of a company that is announcing its profits/losses is expected to rise or fall. To simplify the situation, one of two possibilities must happen. The company announces good profits so its share price will rise from 60$ to reach 110$, or it will announce losses and its share price will fall from 60$ to 10$. Assuming that Kevin knows about these two possibilities, he is left with a hard decision, either to sell his shares of this company or to buy more. Trying to rely on statistics he may average the two possibilities; , which means that the share price will remain fixed, and this is definitely not going to happen. However, some other factors are biased and are predictable to some extent. A company achieving high profits for the past ten years is expected to announce profits for the current year and, consequently, its share price will increase.
Some events which affect the stock market are associated with particular periods of time; During Christmas and new year’s holidays share prices almost always fall, because investors need cash to celebrate. A few days later share prices rise again.
On the other hand, some events are nonpredictable at all. A terrorist attack on an oil field in Saudi Arabia or a fire that break out in an oil refinery in Nigeria will cause oil prices to soar. While a discovery of huge oil reserve in Venezuela will cause prices to fall.
Although the above factors may act against Kevin sometimes, they are not conspiring against him, but other investors do. Kevin is faced with many investors who are acting against him because every penny he makes is subtracted from their profits. He has to be aware that the decisions he make are reciprocated by other investors’ decisions that ruin his gains and make his life harder. He in return has to act in return and wait for their action, and so on.
2.3.5 Diversity of Solutions
A source of difficulty which is unique to MOPs is the diversity of solutions. In addition to the above difficulties, the problem solver has to present solutions that cover the entire PF
, moreover these solutions should be uniformly distributed over that front to give the
DM a flexibility in decision making.For example, the head of the design department in a mobile phone manufacturing factory has asked four design engineers each to present 21 phone speaker designs that adhere to speaker manufacturing standards but have a varying tradoff between speaker size and its cost. These four groups of 21 designs each are to be handled to the marketing department to choose a single design to be manufactured.
The first engineer presented the designs shown in Figure (a)a. These designs are biased towards one end of the PF. As a consequence, the DM has plenty of designs with high cost and small size, but few designs with low cost and relatively bigger size. As the speaker size increases, the designs get separated by an increasing incremental step. These designs are not uniformly distributed across the PF which is represented by grey line as shown in Figures 2.7.
The situation with the second engineer was different. His designs, presented in Figure (b)b, was biased towards both ends of the PF. Most of the designs are for big cheap speakers or small expensive ones. The DM is left with few options inbetween. These designs also are not uniformly distributed across the PF.
However the third design engineer provided a different alternative. Despite the designs he made are uniformly distributed as shown in Figure (c)c, they only cover a medium range of the PF. The DM does not have the option to choose a highly expensive and tiny speakers or an overly big and cheap ones. These designs does not cover the entire PF.
The fourth engineer presented the best solution. His designs, as shown in Figure (d)d does cover the entire PF. Moreover, they are uniformly distributed across that front. The DM in this case can choose from a sample of designs which fairly represent all possible designs adhering to speaker manufacturing standards, and with a varying tradeoff between speaker size and its cost.
3.1 Introduction
During The Voyage of the Beagle (1831–1836), Charles Darwin (1809–1882) noticed that most species produce more offspring than can grow to adulthood, however the population size of these species remains roughly stable. It was the struggle for survival that stabilizes the population size given the limited, however, stable food resources. He noticed also that among sexually reproductive species no two individuals are identical, although many of the characteristics they bear are inherited. It were the variations among different individuals that directly or indirectly distinguished them among other species and their peers, and rendered some of them more suitable to their environment that the others. Those who are more fit to their environment are more likely to survive and reproduce than those who are less fit to their environment. Which means that the more fit individuals have greater influence on the characteristics of the following generations than the less fit individuals. This process results in offspring that evolve and adapt to their environment over time, which ultimately leads to new species [35]. Darwin has documented his observations and hypotheses in his renowned yet controversial book “On the Origin of Species” [35] which constituted the Darwinian Principles.
Although Darwinian principles brought a lot of contention due to apparent conflict with some religious teachings to the point that some clergymen considered it blasphemy, these principles captured the interest of many naturists and scientists. It wasn’t until the 1950’s when computational power emerged and allowed researchers and scientists to build serious simulation models based on Darwinian principles [36]. During the 1960’s and 1970’s, the field of Evolutionary Computation (EC) started to take ground by the work of Ingo Rechenberg [37], John Bagley [38], John Holland [39] and Kenneth De Jong [40]. Those researchers developed their algorithms and worked separately for almost 15 years until early 1990’s when the term CI and its subfields were formalized by the IEEE Neural Network Council and the IEEE World Congress on Computational Intelligence in Orlando, Florida in the summer of 1994 [41]. Since then, these various algorithms are perceived as different representations of a single idea. Currently, EC along with SI, Artificial Neural Network (ANN), Fuzzy systems among other emerging intelligent agents and technologies are considered subfields of CI [42].
Researchers have utilized different EC models in analyzing, designing and developing intelligent systems to solve and model problems in various fields ranging from engineering [41, 43], industrial [44], medical [45, 46], economic [47] among many others [48]. For a comprehensive record on EC history the reader may refer to [49, 48, 50]
3.2 How and Why They Work?
EAs are a subfield of EC. They are biologically and natureinspired algorithms. They work by evolving population of potential solutions to a problem, analogous to populations of living organism. In realworld, individuals of a population vary in their fitness to their environment, some of them can defend themselves against attacks, while others can’t survive an attack and perish. Some can provide food for themselves and their offspring all the time, while others may not survive a short famine and starve to death. In EAs, individuals of a population, likewise, are not equally fit. Due to differences in their characteristics some of them are more fit than the others, and because the resources that keep them alive are limited, the more fit is the individual the more likely it will survive and take part in new generations. Those who are fit enough to survive have the chance to mate with other fit individuals and produce offspring that carry a slightly modified version of their parents’ good characteristics. But those fertile parents produce more offspring than their environment resources can support. So the best of the good individuals in the population will survive to the next generation and start a new cycle of mating, reproducing and fighting for survival. This cycle is shown in Figure 3.1, where terminate is the termination condition; it could be a certain number of generations or an error tolerance value or any other condition. As generations pass by, the average fitness of individuals increases because they adapt to their environment. Which means that this population of evolving solutions will explore the solution space looking for the best value(s) and will progressively approach it(them).
The idea of using Darwinian selection to evolve a population of potential solutions has a another great benefit in solving dynamic problems. Because most reallife problems are dynamic, the definition of fitness and the rules of the problem may change once the problem has been formalized. So it will be useless to solve the problem using the old rules because it means solving a problem that does not exist any more. However, by using a population of evolving solutions, individuals adapt to the new rules of their environment over time.
EAs are a collection of algorithms that share the theme of evolution. The mainstream instances of EAs comprise Genetic Algorithms (GAs) [39], Evolutionary Strategies (ES) [37], Evolutionary Programming (EP) [51]. In addition to those three methods, Genetic Programming (GP) [52, 53], Learning Classifier Systems (LCS) [54, 55] and hybridizations of Evolutionary Algorithms with other techniques are classified under the umbrella of EAs as well [56].
3.3 Genetic Algorithms
Genetic Algorithms are the most popular method among all EAs. It was proposed by John Holland in 1962 [57] during the wake of the quest for the general problem solver [58, 59, 60]. GA is a biologicallyinspired algorithm. It evolves a population of individuals encoded in bitstrings (some variations of GA use real numbers representations). By analogy to biology, the encoded bitstring structure of an individual in a population is known as a chromosome. Figure 3.2 shows an example of chromosomes consisting of 8 bits each.
A GA solver starts its procedure by creating a population of individuals representing solutions to the problem. The population size (the number of individuals in the population) is among the parameters of the algorithm itself. A big population size helps exploring the solution space but is more computationally expensive than a smaller population which may not have the exploration power of the bigger population. After the population is created, the fitness of its individuals is evaluated using the fitness function. The fitness function is a property of the problem being optimized and not the algorithm. It reflects how good is a potential solution and how close it is to the global optimum or the PF. So, a good knowledge of the problem is required to create a good fitness function that describes the problem being optimized as accurately as possible. After evaluating their fitness, some individuals are chosen to reproduce and are copied to the mating pool. The number of individuals to be chosen (the size of the mating pool) is a parameter of the algorithm. Different mechanisms of selecting which individuals to reproduce is explained in Subsection 3.3.2, but for now it is enough to say that most of these mechanisms favor individuals with higher fitness values over those with lower fitness values. The next step is to match those individuals for reproduction which is done by random in most cases.
Two variation operators are applied on the matched individuals (parents) to produce their offspring. The fist variation operator is the crossover between the parents’ chromosomes. As shown in Figure 3.2, parent1 and parent2 are two matched individuals. The chromosomes of those two parents are cut at the crossover point (between bit #4 and bit #5), and the resulting half chromosomes are swapped to create two offspring, offspring1 and offspring2. It is to be noted however that the crossover operator may not be applied on all parents in the mating pool. The crossover ratio defines the percentage of parents in the mating pool which will be affected by the crossover operator. This value is algorithm dependent but it varies around 0.9 for most GA implementations [61]. After the crossover is done, the offspring chromosomes are mutated. Mutation of chromosomes in binary representation is done by flipping one or more bits in a chromosome. As shown in Figure 3.2, offspring1 was mutated by flipping its bit from . As the case with the crossover operator, the mutation operator may not affect all individuals. The ratio of mutated bits to the total number of bits is known as the mutation ratio and is typically below one percent [61]. As the mutation ratio increases, the algorithm becomes more a random search algorithm.
The offspring join the population after evaluating their fitness to fight for survival. This stage is crucial for all individuals; based on their fitness, some of them will survive to the next generation, while others will perish. Different mechanisms can be used to select surviving individuals. Many of which are probabilistic techniques that favor more fit individuals.
The surviving individuals makeup a new generation and restart the cycle of fitness evaluation, mating selection, reproduction and survival selection. This cycle is repeated until a stopping condition is met, which could be the number of cycles, the solution error of the best individual or any other condition or mix of conditions set by the algorithm designer.
GAs are flexible problem solvers; they are less dependent on the problem being solved than traditional techniques. Moreover they provide high degree of flexibility for their designer. A GA designer can choose a suitable representation scheme, a mating selection technique and the variation operators of his choice.
3.3.1 Representations
How can a GA be less dependent on the problem being optimized than a traditional technique? The answer is because the problem is being transformed to the algorithm domain before the GA starts solving the problem. Some people even argue that a GA is problem independent because this transformation is done before the algorithm is invoked. The transformation to the problem domain is done mainly by providing an encoding mechanism for possible solutions to the problem, aka ’representation’, and by creating a fitness function.
The choice of which representation to use should be done within the context of the problem. A good representation for a scheduling problem may not be suitable for the Rastrigin problem given earlier in Section 2.1 and vice versa.
Binary Representations
The oldest and most used scheme of representation is the binary representation. As shown in Figure 3.3, a chromosome consists of a string of bits, each one of those bits is known as an allele. A gene is a combination of one or more alleles which determine a characteristic of the individual. For example, if the chromosome shown in Figure 3.3 is for an imaginary creature, then the three circled alleles may represent its eye color gene, and by varying the values of those three alleles its eye color will change.
To use binary representation for a combinatorial problem, each possible solution to the problem can be represented by a unique sequence of digits. For example, the solutions of the SAT problem given in Subsection 2.3.1 can be directly transformed to bitstrings. So the chromosome will be one of different solutions to the 10variables SAT problem.
However, using binary representation to encode real valued solutions of the Rastrigin problem will work differently. First, the search space or the solution space is continuous, which means there are infinite number of possible solutions, so it must be sampled. A very small sampling step leads to huge number of possible solutions which fairly represents the solution space but it will be computationally expensive to search this huge number of solutions. While a big sampling step make it less expensive but on the expense of a bad representation of the solution space. If the precision used to sample the Rastrigin problem is four digits after the decimal point it means that there are possible solutions, which requires 17 digits to represent them. So, and . A transformation of the binary string back to the decimal form is done by transforming the binary number to a decimal one then using the following rule:
(3.1) 
So the solution is transformed to the decimal number then by using (3.1)
(3.2) 
Realvalued Representations
It could be more suitable for problems with real valued solutions to be represented using a realvalued representation. In this representation, each individual is a real valued number that expresses the value of the solution. Using this representation, there is no need to define a solution precision, such as the one defined for the binary representation, because the search or variation operators will be able to transform this solution to any of the infinite number of possible solutions in the solution space (if the floating point precision of the computer used to solve the problem is unlimited). However this type of representation requires another mechanism of crossover and mutation as explained in Subsection 3.3.3.
Integer Representations
For permutation problems, it might be more intuitive to represent solutions using integer values. This is illustrated using the following Traveling Salesman Problem (TSP) example.
The TSP is a combinatorial optimization problem that has been extensively studied in the last 150 years [62] due to its numerous applications such as in transportation, networking and printed circuit manufacturing. The problem is defined as [62]: given cities and their intermediate distances, find a shortest route traversing each city exactly once.
As shown in Figure 3.5, a possible route the traveling salesman can take is . This route can be directly mapped to and represented by the string of real numbers , where the integers represent the cities to be visited by the order they are stored in the string.
ProblemSpecific Representations
Using only one of the previously mentioned representations to represent all the variables of a problem may not be the best option for some problems, which is the case in the following example; The inverted pendulum problem shown in Figure 3.5 is a classic problem in control engineering. The objective is to construct a function that maps the system states to a horizontal force applied on the cart carrying the inverted pendulum to stabilize it vertically. The states of this problem are the position of the cart , its speed , the angular position of the pendulum and its angular velocity . For such a problem the parameters will be the input to the function and the output will be the applied horizontal force. This problem can be represented using a parse tree of symbolic expressions ^{1}^{1}1although parse tree representations are mainly used for GP, it is was originally proposed as a GA representation [52] [52, 63, 15]. The parse tree consists of operators, such as () and arguments, such as (). A possible tree may look like:
(3.3) 
which is equivalent to:
(3.4) 
More on Representations
The three schemes of representations mentioned above are by no means exhaustive. An algorithm designer may come up with his own representation which may better suit his problem. However, the representation scheme, as mentioned before, should never be considered independent of the problem; If the variation operators used do not suit the representation, they may produce offspring who are totally different from their parents, which acts against the learning process because the algorithm samples new trials from the state space of possible solutions without regard for previous samples. This procedure may perform worse than random search [15]
Some representations incorporate the constraints of the problem and provide feasible solutions for the problem all the way. Although it may be desirable to evolve feasible solutions all the time and not to worry about the constraints or define them explicitly in the problem, it is useful to allow the algorithm to evolve few infeasible solutions to explore the solution space for possible scattered feasible regions that may contain better values.
3.3.2 Mating Selection
After a decision about how solutions will be represented is made, the next step is to decide about the mating selection procedure. Mating selection focusses on the exploration of promising regions in the search space [64]. It tends to select good individuals to mate—hoping that their offspring will be as good or better than them because they will inherit many of their parents’ characteristics. However, if the selection procedure strictly selects the very few top of the population, the population will lose its diversity because after few generations the population will only contain a slightly different copies of few individuals who were the best in their generations, and if this procedure continues, the population will be madeup of almost identical individuals after few more generations. The main idea behind a GA is to evolve a population of competing individuals, so if the individuals became identical, there will be no competition and henceforth no evolution. So, keeping a diverse population indeed helps the algorithm explore the search space [63]. The degree to which the algorithm favors and selects the best individuals in the population for mating is measured by the selection pressure. The selection pressure is defined as “
the ratio of the probability of selecting the best individual in the population to that of an average individual
” [64]. As the selection pressure increases, the algorithms tends to choose the very best of the population to mate and produce the next generation leading to a population that converges to a local optima, this situation is known as premature convergence. On the other hand, if the selection pressure decreases, the algorithm will converge slowly and wander in the search space. It should be clear that the selection pressure is not a parameter that the algorithm designer explicitly set its value, instead, it is influenced by different aspects of the algorithm, especially the selection mechanism used. It is to be noted that the following selection mechanisms define the preference of selecting an individual in the population, however the number of offspring this individual produces for each mating process is another issue.Fitness Proportionate Selection
Fitness Proportionate Selection (FPS) is one of the earliest selection mechanisms proposed by Holland [39] (Sometimes known as roulette wheel selection). It selects individuals based on their absolute fitness. That is, if the fitness of an individual in the population is , the probability of selecting this individual is , where is the population size. This selection mechanisms allows the most fit individuals in population to multiply very quickly early in the run and take over the population, and after few more generations the diversity almost vanishes and the selection pressure becomes very low leading to stagnant population. In other words, leading to premature convergence. Another handicap of this mechanism is that it behaves differently on transposed versions of the same fitness function [65]. For example, given a population of two individuals and with fitness values 1 and 2, respectively. The probability of choosing the first individual is while that of the second is , this is a ratio. However if a constant value of 10 is added to the fitness values of the population the probabilities will become and which is almost a ratio.
A possible remedy to the weak selection pressure and the inconsistent behavior of the algorithm along the run can be achieved by subtracting the fitness of the worst individual in the population of a window containing the last generations. This approach can be considered as a dynamic scaling of the population fitness.
Another possible solutions is achieved by using Goldberg’s sigma scaling method [54], which scales the fitness of individuals using the mean
of fitness in the population.(3.5) 
Rank Selection
Another way to overcome the deficiencies of the FPS is to use the rank selection. As its name implies, this procedure orders all individuals in the population according to their fitness, and then selects individuals based on their rank rather than their absolute fitness such as in FPS. This methods maintains a constant selection pressure because no matter how big is the gap between the most and least fit individuals in a population, the probability of selecting each one of them will remain the same as long as the population size remains the fixed.
After the individuals are ordered in the population, they are assigned another fitness value inversely related to their rank. The two most used fitness assignment functions result in linear ranking and exponential ranking.
For linear ranking, the best individual is assigned a fitness of , while the worst one is assigned a value of
. The fitness values of the intermediate individuals are determined by interpolation. This can be achieved using the following function:
(3.6) 
Where is the individual rank and is the number of individuals. For linear ranking, the selection pressure is proportional to .
For nonlinear ranking, the best individual is assigned a fitness value of 1, the second is assigned , typically around 0.99 [66], the third , and so on to lowest ranked individual. The selection pressure for nonlinear ranking is proportional to .
Tournament Selection
Unlike previously mentioned selection procedures, the tournament selection does not require a knowledge about the fitness of the entire population, which can be time consuming in some application with huge populations which requires fast execution. It is suitable for for some situations with no universal fitness definition such as in comparing two game playing strategies; it might be hard to set a fitness function that evaluates the absolute fitness of each of those two strategies, but it is possible to simulate a game played by those two strategies as opponents, and the winner is considered the fittest.
Tournament selection picks individuals by random from the population and selects the best one of them for mating. This does not require a full ordering of the sample nor an absolute knowledge of their fitness. The selection pressure of this mechanism can be varied by changing the sample size , it increases by increasing and reaches the maximum at . This selection mechanism can be used with or without replacement. Moreover a nondeterministic selection can be used, which means that the probability of selecting the best individual in the sample is less than one to give a chance for the worst individual in the population to be selected, otherwise, it will never be selected.
This selection mechanism is the most widely used because it is simple, does not require knowledge of the global fitness, adds low computation overhead and its selection pressure can be easily changed by varying the sample size . A common sample size is [67].
3.3.3 Variation Operators
In realworld, although parents and their offspring have common characteristics, they are never identical; Like father like son, but the son is not a clone of the father. This variation among individual was noted by Charles Darwin in his controversial book [35] where he emphasized that this variation is a major force that drives the evolution process. To mimic this process in GAs, researchers have developed variation operators that help the algorithm search the solution space, henceforth, they are sometimes known as search operators. These variation operators have two goals: The first one is to produce offspring that resemble their parents, while the second one is to slightly perturb their characteristics. The oldest and most widely used variation operators are the crossover and the mutation operators [39]. They were proposed by John Holland to operate on binary GAs, however many other variation operators were proposed to operate on other forms of GA representations [69]. It is to be noted that all syntactic manipulations by variation operators must yield semantically valid results [70].
Crossover
The fact that the offspring very often look much like their parents intrigued Gregor Mendel (1822–1884) and lead him to do his famous experiments on pea plants. By analyzing the outcomes of his experiments on some 28000 samples he reached a conclusion about the rules of inheritance and published these findings in a paper [71] which is considered to be the basis of modern day genetics rules. Researchers in GAs were inspired by these rules and used a modified version of them in their algorithms [39], and from there came the GA crossover operator.
A recombination operator is known to be an exploitation operator because it exploits the accumulated knowledge in the current solution vectors by using parts of them as building blocks of their offspring. It is widely believed among AI community that exploitation should be emphasized at later stages of the search to prevent premature convergence [72, 73].
Different versions of the crossover operator are applied to binary, realvalued, and parse tree representations but they essentially do the same job; They simply create the genes of an offspring by copying a combination of its parents genes.
npoint crossover [40] is used mainly for binary representations. The operator randomly picks points along a copy of the two parents chromosomes, divide each one of them into parts, and then swap some of these parts to create the offspring chromosomes. Figure 3.6 shows an example of a 3point crossover operation. First, the crossover points are randomly assigned the positions shown in figure, dividing each parent into 4 parts. Then, offspring1 is created by copying the first and third parts of parent1, and the second and fourth parts of parent2. While offspring2 is created by copying the first and third parts of parent2, and the second and fourth parts of parent1. The probability of independently crossing over each individual in the mating pool is known as the crossover ratio .
Uniform crossover is another alternative to use with binary GAs. The number of crossover points in this operator is not fixed, instead, it creates each allele of the offspring by copying one of the two corresponding alleles in its parents with a certain probability. This procedure can be explained using the following Matlab code:
By increasing the value of , the offspring alleles will be more like those of p1. The previous code is repeated for each offspring.
A realvalued representation can be transformed into binary representation to apply a binary crossover operator as shown previously, and then get transformed back to realvalued representation, but it is not recommended to follow this procedure and add such a computational cost and sacrifice precision due to sampling and roundingoff errors encountered in decimal–binary—binary–decimal conversion. Instead, it is recommended to apply some of the recombination operators specially made for realvalued representations.
The blending methods create the variables of the offspring by weighting the corresponding variables in their parents chromosomes. For example, to create the variables in the offspring chromosome, the following rule can be used:
(3.7)  
(3.8) 
Where is the variable of the first offspring, and is the weighting factor in the range [0,1]. The higher the value of , the more the offspring will look like the first parent (), and the lower gets, the more the offspring will resemble the second parent ().
The Simulated Binary Crossover (SBX) operator is a recombination alternative to consider for realvalued representations. This operator was proposed by Deb [69], and he claims that it provides selfadaptive search mechanism [74]. This operator is explained using the following rules:
(3.9)  
(3.10) 
and is evaluated using the following rule:
(3.11) 
where is a uniformly distributed random number in the range , and is a distribution index; a low value of (less than 2) gives high probability of producing offspring different from their parents, while a high value (greater than 5) means that the offspring will be very close to their parents in the solution space.
For parse tree representations, a recombination operator do swap subtrees of the solution vectors. For example, given the following solution vector
(3.12) 
by swapping the subtrees and , it becomes
(3.13) 
However care must be taken to prevent trees from growing rapidly and reaching an extremely long length, because it may halt the machines executing the algorithm. This is known as bloating, and it is a major concern in GP implementations.
Mutation
Unlike recombination operators, mutation operators do not make use of the knowledge of the search space acquired through generations, they do perturb the population by providing random genetic material provided they result in semantically valid results.
For binary GAs, the mutation operator first determines the positions of the alleles that will undergo mutation. However, the choice of these alleles is made at random with equal probability for each one of them to be mutated (uniform distribution), and their number is determined using the mutation ratio , which is the probability of independently inverting one allele. Second, the operator flips the selected alleles to produce the mutated offspring. For example, given a pool with two solution vectors , , and . A mutation operator being applied on them will mutate alleles, and may turn the solution vectors into , , respectively. Note that the two alleles to be mutated happened to be at the first solution vector (shown in boldface), while the second one remained intact.
Mutating realvalued GA pose some challenges. If the mutation operator does not operator on specific values of the parents [75], it will allow the offspring escape local optima and will help the algorithm explore new regions of the search space. But it will break the link between the parents and their offspring instead of causing causing small perturbation. On the other hand, if the mutation operator do operate on specific values of the parents to produce their offspring, it may not be very helpful in escaping local optima, but will keep the link between the parents and their offspring. The later approach will be further explained here.
For a dimension solution vector the mutation operator will be in the form [76]
(3.14) 
where x is the parent solution vector, is the mutation operator, and is the mutated offspring solution vector. The mutation operator
may simply add a real random variable
M to the parent vector(3.15) 
where is the variable of the solution vector x. It is recommended to create M with a mean value of 0 to prevent a bias towards some parts of the search space and keep the offspring uniformly distributed around their parents. If M has a uniform distribution in the range , it will be equally probably that will take any value in the hyperbox if mutated. Alternatively, M
can have a normal (or Gaussian) distribution which can be represented by the following equation:
(3.16) 
where is the standard deviation, and is the mean. If is set to 0, the value of will determine the probability of different mutation strengths; As increases, it becomes more probable that the offspring will lie away from its parent, while decreasing value increases the probability of producing offspring that looks similar to its parent.
The Polynomial Mutation is another mutation operator for realvalued GAs [69, 77]. The offspring is mutated using the following rule
(3.17) 
where and are the upper and lower bounds of the variable , respectively, and is calculated from a polynomial distribution by using
(3.18) 
where is a uniformly distributed random number in the range , and is a mutation distribution index.
The final mutation operators to be presented here is the parse tree mutation operators. Among many operators, Angeline defines four mutation operators [76]. The grow operator randomly selects a leaf from the tree and replace it with a randomly generated subtree (Figure (a)a). The shrink operator randomly selects an internal node from the tree and replaces the subtree below it with a randomly generated leaf (Figure (b)b). The switch operator randomly selects an internal node from the tree and rearrange its subtrees (Figure (c)c). While the cycle operator randomly selects a node from the tree and replaces it with another node, provided that the resulting tree is a valid one (Figure (d)d).
3.3.4 Schema Theorem
In his early work on GAs, John Holland presented the schema as a building block for individuals in a population. While explicitly evaluating the fitness of the individuals, a GA implicitly evaluates the fitness of many building blocks without any added computation overhead (implicit parallelism). As the search for the optimal solution progresses, the algorithm focusses on promising regions of the search space defined by schemata having fitness above the average of the population in their generations (aboveaverage schemata); The more fit are the individuals a schema produces, the more samples will be produced from this schema [63]. The following discussion on schemata assume a binary GA representations.
A schema is a template for solution vectors. It defines fixed values for some positions, and assumes don’t care values for the other positions. For example, given the following solution vector of length
(3.19) 
there are six fixed positions (taking a value of ‘1’ or ‘0’), and four don’t care positions (marked by ‘*’). The number of the fixed positions in a schema is known as the order or the schema. So the order of the schema is . Another property of the schema is its defining length, which is the distance between the first and last fixed positions in a schema. The defining length of the schema is
Any schema may produce distinct individual, where is the number of don’t care positions in this schema. On the other hand, any solution vector can be matched by different schemata. It is clear that the schema is the most general schema; It matches solution vector (which includes the entire population). While is among the most specific schemata; It matches a single solution vector, namely .
Another property of a schema is its fitness at time , . The fitness of a schema is defined as the average fitness of all individuals in the population matched by this schema [78].
Given a population of size , with solution vectors of length and a fitness function . The schema which matches individuals in the population at time will have an average fitness of
(3.20) 
If the number of individuals matched by schema is and their average fitness is , where , then it is expected that for the next generation the number of individuals matched by will be
(3.21) 
if the average fitness of the population is , the the previous equation can be rewritten as
(3.22) 
From the last equation, it is clear that if the schema is aboveaverage for the current generation (), then the number of individuals matched by this schema will increase in the next generation. But if the schema is belowaverage, the number of individuals this schema matches will decrease in the next generation.
If is known, then can be directly evaluated from (3.22)
(3.23) 
assuming that will maintain a fixed aboveaverage fitness through generations.
The reproductive schema growth equation (3.23) only considers selection and ignores the effect of crossover over the population. To understand how crossover may disrupt a schema, the following example is presented
If the two solution vectors , which is counted among , and do exist in a population, and the crossover operator decided that those vectors will be crossedover right after the sixth allele. none of the two offspring solution vectors produced (, and ) will be matched by the schema . However, if the crossover operation took place right after the third allele, the offspring will look like , and , and there will be one solution vector () matched by . Which means that equation (3.23) is not accurate.
It is clear that a higher defining length for a schema increase the probability of its destruction, because it will be more likely that the crossover point will fall between the first and the last fixed positions of the schema. Henceforth, given the crossover rate (), the probability of schema survival will be . But the crossover operation may not destroy the schema even if it splits the vectors between two fixed positions; It is possible to crossover two solution vectors matched by the same schema, so regardless of the crossover position, the two resulting offspring would still be matched by that schema. So, the probability of schema survival will slightly increase, and the reproduction schema growth equation will look like;
(3.24) 
To make the above equation more accurate, the mutation operator should be considered as well.
The mutation operator will destroy a schema only if it inverts one of its fixed positions. So, the higher the order of a schema is, the more likely it will be destroyed by a mutation operator. Henceforth, given the schema order () and the mutation ratio (). The probability that a fixed position will survive mutation is (), and the probability of schema survival is , which for normal mutation ratios () can be approximated to . The reproduction schema growth equation will now look like:
(3.25) 
This equation shows how fast a schema may influence the population; The number of individuals created using this template exponentially increases over time if this schema is aboveaverage and has sufficiently short defining length and low order. This equation shows also that a schema with relatively short defining length and low order will be sampled more than another schema with longer defining length and higher order.
Theorem 1 (Schema Theorem).
Short, loworder, aboveaverage schemata receive exponentially increasing rate of trials in subsequent generations of a genetic algorithm.
For a better understanding of the schema theorem and theoretical data given above ,the following example is given.
A GA is running a population of size , with individuals of length . It has a crossover and mutation ratios of and respectively. When the optimizer was initialized there was 5 individuals matched by the schema after initialization, .
Using the reproduction schema growth equation, given in (3.25), two tables are created; Table 3.1 illustrates the effect of the schema/population fitness ratio on the schema growth rate, while and , and Table 3.2 shows the effect of different defining length and order values for schema on its growth rate, while .
The values shown in the body of the two tables are the expected number of individuals that schema will match through generations (rounded to floor), and a “–” entry means that the overwhelming number of schema instances in the population caused violation of the assumed average population fitness.
Generations  
1  10  20  30  40  50  60  70  80  90  100  
1.8  5  4  2  2  1  1  1  0  0  0  0 
1.9  5  6  7  8  9  11  13  15  18  21  24 
2.0  5  9  18  35  69  134  263  –  –  –  – 
2.1  5  14  45  144  –  –  –  –  –  –  – 
2.2  5  22  110  –  –  –  –  –  –  –  – 
Generations  
1  10  20  30  40  50  60  70  80  90  100  
(10, 6)  5  11  24  55  125  285  –  –  –  –  – 
(10, 7)  5  9  17  33  63  120  229  –  –  –  – 
(10, 8)  5  8  12  19  31  50  79  127  203  –  – 
(11, 8)  5  4  3  3  2  2  1  1  1  1  1 
(12, 8)  5  2  1  0  0  0  0  0  0  0  0 
The figures presented in Table 3.1 show how increasing the schema fitness by 10% allows the schema to switch form vanishing (for 1.8 ratio) to exponentially increase its samples (for 1.9 ratio). By increasing the fitness ratio further, the schema takesover the population more and more faster, until at a fitness ratio of 2.2, takesover the population in less than 30 generations.
Table 3.2 shows how a slight modification of the defining length or the order of the schema can have a significant effect on its growth rate; A unit increase in the defining length (from 10 to 11) leads to extinction. A unit increase in the schema order affects its growth rate indeed, but not as dramatically as its defining length.
3.4 Polyploidy
Due to the good results obtained using first GA representations [39, 40], researchers used them without significant modifications. Although, many living organisms carry redundant chromosomes in there cells (polyploid organisms), the effect of polyploidy remains underresearched in Evolutionary Multiobjective Optimization (EMO). In nature, redundant chromosomes help organisms adapt to their environment. When a large asteroid or comet hit earth 65 million years ago, some species adapted to their new environment and survived, dinosaurs failed to adapt and perished. Moreover, these redundant chromosomes help keeping a varied set of organisms to fill different niches in the environment that these organisms live in. A heavy and muscled animal can defend itself against attacks but is slower than a less heavier and less muscled animal which can catch its prey. A tradeoff between those two traits (muscles and weight) provides animals with advantages over one another.
Polyploid species have redundant chromosomes in their cells. The alleles which result in an organism that well fits its environment tend more to be expressed (dominant alleles) than the other alleles (recessive alleles). Meanwhile, the other alleles are held in abeyance and rarely expressed until the environment changes to favor one or some them and makes them the new dominant alleles.
Though there are dominance schemes other than simple dominance, such as partial dominance and codominance, their effects were not investigated before. This is mainly because most research on polyploidy was carried out on binary problems. In partial dominance, an intermediate value between parents’ alleles is expressed. It provides even more population diversity than simple dominance in which a distinct parent allele is expressed. A classical example of partial dominance is the color of the carnation flower that take variants of the red color due to the presence or absence of the red pigment allele. In the codominance scheme, both alleles are expressed. A well known example for codominance is the Landsteiner blood types. In this example, both ‘A’ and ‘B’ blood type alleles are expressed leading to an ‘AB’ blood type which carries both phenotypes.
3.4.1 Current Representations
Early work examining the effect of polyploidy in GA goes back to 1967 in Bagley’s dissertation [38] as he examined the effect of diploid representation. In his work he used a variable dominance map encoded in the chromosome. A drawback of his model was the premature convergence of dominance values which led to an arbitrary tie breaking mechanism [79]. This work was followed by a triallelic dominance scheme used by Hollstien [80] and Holland [39]. For each allele, they added a dominance value associated and evolved with it. It took values of 0, a recessive 1 or a dominant 1, though they used different symbols.
Unlike previously mentioned work which was done on stationary environments, Goldberg and Smith [79] used a nonstationary environment. They used a 01 knapsack problem with two evolving limiting weights. They concluded that the power of polyploidy is in nonstationary problems because of the abeyance of recessive alleles that remember past experiments. However, they did not show the performance of their algorithm in remembering more than two oscillating objectives. This was a big shortcoming, because most real world nonstationary problems are noncycling problems, they may come out of order, and sometimes never repeat. Ng and Wong [81] argued that the enhanced performance in [79] was due to the slow convergence encountered in the diploid representation. They proposed a dominance scheme that used dominant (0, 1) and recessive (0, 1) alleles and inverted the dominance of alleles whenever the individual’s fitness fall below a 20% threshold value. They reported enhanced performance over triallelic representation. Some researchers extended the application of polyploidy beyond GA. Polyploidy was applied to Genetic Programming (GP) as well [82].
All the work previously mentioned were conducted for single objective optimization problems. Most of the researchers used relatively low number of decision variables and binary problems such as the 01 knapsack problem. At the same time many of these investigations were concerned with manipulating the simple dominance scheme and comparing its variants. The monoploid number (number of chromosomes in each solution vector) remained constant in most of these investigations. Scarce applications such as [83] were done on MultiObjective Optimization Problems (MOOP). In [83], diploid vectors were used to search in 2dimensional space optimizing 3 objectives for a food extrusion process. First, they worked on each objective separately and produced offspring better than the worst individual for this objective (otherwise the offspring is killed) and then Pareto dominance was applied to the combined populations of each objective.
What is common among all previous work is that they used simple dominance where the allele either dominates or recesses, which is not always the case in nature. Partial dominance and codominance were not investigated before. Partial dominance produces new phenotypes that help the population to adapt to totally new environments and to remember them. It provides more phenotype diversity than simple dominance. Henceforth, a new model is provided here which differs from other polyploid models [39, 80, 79, 38, 81]. In this model; i) Uniform crossover of each parent chromosomes precedes recombination. ii) The offspring alleles are created from their parents’ alleles using partial dominance. iii) The algorithm is applicable to nonbinary problems.
3.4.2 Proposed Representation and Procedure
In biology, ploidy is the number of sets of chromosomes in a cell. A cell that has one set of chromosomes is a monoploid cell, while the one that has two sets is a diploid cell, three sets make it triploid and so on. The term “dploid” is used to indicate the ploidy number. So, 1ploid representation is a monoploid one, 2ploid representation is a diploid one and so on. In this algorithm the partial dominance scheme is used. The phenotype of each solution vector is based on the partial dominant alleles or the Dominantallelesset (DAS) as shown in Figure 3.8. The DAS of any solution vector determines its fitness value. The ploidy number of a solution vector is the number of all chromosomes in that vector including the DAS chromosome. A detailed procedure of the algorithm is given as follows;

Initialization: For a ploid representation. A population of size is initialized at random by filling each DAS and the chromosomes for each of the solution vectors. Then, the fitness of each vector is evaluated according to its DAS. The domination of solution vectors is determined based on Pareto dominance.

Mating Selection: Only nondominated solution vectors are selected for mating. This will result in a varying mating pool size and consequently a varying offspring size for each generation, leading to high selection pressure.

Variation Operators: After filling the mating pool, two parents are selected at random from the mating pool without replacement. Then, for each parent, an allele representing each locus of a chromosome is randomly selected from all available alleles at this locus to create two sets of alleles (one for each parent). Any recombination operator may be applied to those two sets to create the child’s DAS. Here, the SBX [69] operator is used. Then, the child’s DAS is mutated. Any mutation operator may be used here. Polynomial mutation is used here [69]. After the child’s DAS is created, the remaining chromosomes of this child’s solution vector are selected and copied at random from both parents’ chromosomes. All chromosomes in both parents have the same probability of being selected and copied to carry their genes to later generations. No mutation is applied on these redundant chromosomes.

Survival Selection: After evaluating the fitness of the offspring, parents and offspring fight for survival as Pareto dominance is applied to the combined population of parents and offspring. The least dominated solution vectors (according to number of solutions dominating them) survive to make the population of the next generation. Ties are resolved at random.
Figure 3.8 shows the structure of solution vectors for the Polyploid algorithm. The solution vector parent1 contains the DAS chromosome (the seventh row) plus six more redundant chromosomes (rows 1–6), so this is 7ploids representation. Each one of those seven chromosomes contains five alleles. The DAS chromosome of parent1 vector is [a7 b5 c1 d4 e2], where a7 is the first allele, b5 is the second allele, c1, d4 and e2 are the third, fourth and fifth alleles.
After parent1 and parent2 are selected for mating, for each one of them, an allele is selected to represent each locus (column) of this solution vector as shown in Figure 3.8. For parent1, a2 is selected to represent the first locus (column), b6 to represent the second locus (column), c1, d6 and e1 to represent the third, fourth and fifth loci (columns). Henceforth, paretnt1 is represented by the chromosome [a2 b6 c1 d6 e1]. By following the same procedure, the [a1 b3 c8 d6 e5] chromosome is produced for parent2 . Next, partial dominance is applied on these two chromosomes to get the DAS chromosome of the child solution vector. SBX [69] is used to simulate partial dominance, then polynomial mutation is applied on the produced DAS chromosome. After the child’s DAS chromosome is created and mutated, the remaining 6 chromosomes of the child’s solution vector are filled by copying 6 chromosomes from both parents at random to complete the mating process.
3.4.3 Experiments
In the following experiments the convergence and diversity of solution vectors are measured using two running metrics to understand the behavior of the algorithms. The average orthogonal distance of solution vectors to the PF is used to measure convergence because the equations of the global front are known in advance. While the diversity metric2 presented in [84] is modified and used here to measure diversity of solutions. The modified diversity metric2 is explained as follows:

The obtained solutions are projected as well on this objective dimension.

For every projected cell that contains one or more projected solutions an occupation value of 1 is assigned to it, an occupation value of 0 is assigned to the projected cell otherwise.

A diversity value is assigned to each nonboundary projected cell using a sliding window according to the values shown in Table 3.4, where the cell index is . While diversity values for boundary cells are assigned according to Table 3.4, where the boundary cell index is for the left boundary, and is for the right boundary.

The cells’ diversity values are added and divided by the number of cells to give the current objective’s diversity value in the range .

Previous steps are repeated for the remaining objectives.

The average diversity value of all objectives is the overall diversity value of the population. The best population distribution yields a diversity value of 1, while this value approaches 0 for the worst distribution.
occupation values  diversity  
cell()  cell()  cell()  values 
0  0  0  0 
0  0  1  0.5 
0  1  0  0.75 
0  1  1  0.67 
1  0  0  0.5 
1  0  1  0.75 
1  1  0  0.67 
1  1  1  1 
occupation values  diversity  
cell()  cell()  values 
0  0  0 
0  1  0.67 
1  0  0.67 
1  1  1 
Figure 3.9 compares diversity evaluation for the modified and unmodified diversity metric2 for a population of size 6 on a 2dimensional DTLZ2 [85] minimization problem. To evaluate the diversity value for objective using the modified diversity metric2, the obtained PF is uniformly divided into 6 equal cells such as cells a, b and c. These equal cells are projected on the axis giving 6 unequal projected cells. The obtained solutions are projected on the axis as well, and for each projected cell containing one or more projected solutions an occupation value of 1 is assigned (occupied projected cell), an occupation value of 0 is assigned otherwise (empty projected cell). Figure 3.9 shows how the modified diversity metric2 was able to detect the two gaps at cells a and c on the PF and detected all the other occupied cells on the front. The occupation values vector produced is [1 0 1 1 0 1]. Using a sliding window of size 3 with the values presented in Table 3.4, diversity values for each nonboundary projected cell with index is assigned, while the boundary projected cells are assigned diversity values according to Table 3.4. The diversity values vector produced is [0.67 0.75 0.67 0.67 0.75 0.67]. These 6 values are summed and divided by 6 (the maximum possible sum of the 6 diversity values resulting from an ideal solutions distribution) to get a normalized diversity value for the objective (). This procedure is used to evaluate the diversity value for the objective as well. By taking the average of and diversity values, the overall diversity value of the population is .
Figure 3.9 shows how the unmodified diversity metric2 was unable to detect the gap at cell a in the obtained PF when the solutions are projected on , and falsely detected a gap close to the boundary . Furthermore, the unmodified diversity metric2 was unable to detect the gaps at cells a and c when the solutions are projected on , and again falsely detected a gap close the boundary . This shortcoming in the unmodified version is due to its poor cell partitioning. It ignores the shape of the PF, as a consequence, regions with higher slope are under represented in cells and gaps are overlooked in these regions, while regions with lower slope get over represented in cells leading to false gap detection. The modification presented here exploits the knowledge of the PF in benchmark problems. The size of each projected cell is inversely proportional to the slope of its corresponding PF cell.
The mapping of occupation values to cell diversity values is shown in Table 3.4 for non boundary cells. This mapping is made so that occupation vector [0 0 0] gets the lowest diversity value (0) because it reflects an empty region (no solutions in three consecutive cells), while occupation vectors [1 0 0] and [0 0 1] get a higher diversity value (0.5) because only one occupied cell appeared at the edge (non uniform distribution). Occupation vectors [1 1 0] and [0 1 1] get a diversity value of 0.67 because they have two occupied cells albeit not uniformly distributed, while the occupation vectors [1 0 1] and [0 1 0] get a higher diversity value (0.75) due to the uniform distribution of their occupied cells. The best diversity value (1) is assigned to occupation vector [1 1 1] because all of its cells are occupied. A similar mapping is shown in Table 3.4 for boundary cells. Occupation vectors [0 0] and [1 1] get the lowest (0) and highest (1) diversity values respectively, while occupation vectors [1 0] and [0 1] get an inbetween diversity value (0.67).
In the following experiments, some of the DTLZ benchmark problems set [85] are used to investigate the effect of polyploidy on convergence and diversity. The first test problem (DTLZ1) has a huge number of local fronts, but a simple linear PF, while the second test problem (DTLZ2) has a nonlinear (spherical) PF. The third test problem (DTLZ3) combines the difficulties of DTLZ1 and DTLZ2. These three test problems test the ability of the algorithm to escape local optima and converge to the global front with varying degrees of problem difficulty. The fourth test problem (DTLZ4) has a biased distribution of solutions along the PF. This problem tests the ability of the algorithm to maintain a good distribution along the front. Using this varied set of test problems a fair analysis of the behavior of the algorithms regarding convergence and diversity is conducted.
The Nondominated Sorting Genetic Algorithm II (NSGAII) and the Polyploid algorithm differ mainly in two aspects. The first one is the absence of an explicit diversity maintenance mechanism in the Polyploid algorithm unlike the NSGAII algorithm which has a computationally expensive diversity maintenance mechanism. The second one is the redundant chromosomes in the Polyploid algorithm which have no counterpart in the NSGAII algorithm. The following experiments tests whether the redundant chromosomes in the Polyploid algorithm will compensate for the absence of an explicit diversity maintenance mechanism and help the algorithm maintain a good diversity or not. Moreover, the effect of these redundant chromosomes on the speed of convergence to the PF is to be investigated.
All the test problems were run with 3, 4, 6, and 10 objectives and with a population of size . The following parameters were empirically found to produce the best results for the NSGAII and the Polyploid algorithm on the test problems used. For all test problems, SBX is used for recombination with and , and the polynomial mutation is used with and , where is the number of decision variables used.
3.4.4 Dtlz1
The first test problem to be used is DTLZ1 shown in (3.26) and (3.27).
(3.26)  
(3.27) 
Where x and are a decision vector and a decision variable, respectively. The function requires variables, and the total number of variables is , where is the number of objectives.
This problem has a simple linear PF but has a huge number of local optima. There exists local fronts. The high number of decision variables () creates a huge number of local optima to test the ability of the algorithm to escape them. The Paretooptimal solutions for this problem correspond to , (), while the objective functions lie on the linear hyperplane . The algorithms were run for 50,000 function evaluations to analyze the algorithms’ performance in long run.
As shown in Figure 3.10, NSGAII has the best convergence speed in early evaluations, but it get overcome by the Polyploid algorithms one after the other, except for the 10ploids algorithm in the 3 objectives case which does not catch it in the scope of the 50,000 function evaluations (Figure (a)a). However by increasing the number of objectives, the Polyploid algorithms catch NSGAII earlier in the run. In the 3 objectives case, the 7ploids algorithm catches NSGAII after around 35,000 evaluations, while the 10ploids algorithm is unable to catch it (Figure (a)a). But in the 4 objectives case, the 7ploids algorithm overcomes NSGAII after around 23,000 evaluations while the 10ploids algorithm does so after 33,000 evaluations (Figure (b)b). This reveals the ability of the Polyploid algorithms to handle many objectives in rugged objective functions. This ability is magnified when the problem gets more difficult.
Table 3.5 shows the diversity of the algorithms after 50,000 function evaluations. NSGAII has the highest diversity values in the case of 4, 6 and 10 objectives, while the 7ploids algorithm outperforms the other algorithms in the 3 objectives problem. The diversity of the Polyploid algorithms is slightly increasing with increasing the number of objectives. A diversity value of 0.65 for the 2ploids algorithm in 3 objectives reaches 0.79 in the 10 objectives case.
Table 3.6 shows the convergence of the algorithms after 50,000 function evaluations. The 2 ploids algorithm outperforms all the other algorithms in the 3, 4, and 6 objectives, while in the 10 objectives case, the 4ploids algorithm is the best achieving a distance of 382 compared to 389 for the 2ploids which comes second.
By comparing the Polyploid algorithms performance regarding the two performance criteria (convergence speed and diversity maintenance); increasing the ploidy number slightly increases the diversity of most Polyploid algorithms (maximum increase = 0.0618). On the other hand, a higher ploidy number slows down the convergence speed.
objectives  

algorithm  measure  3  4  6  10 
2ploids  Average  0.6506  0.6863  0.7067  0.7942 
Std. Dev.  0.0002  0.0003  0.0002  0.0003  
4ploids  Average  0.7065  0.7075  0.7043  0.7907 
Std. Dev.  0.0004  0.0001  0.0005  0.0001  
7ploids  Average  0.7124  0.7141  0.7480  0.8036 
Std. Dev.  0.0002  0.0002  0.0002  0.0003  
10ploids  Average  0.6993  0.6953  0.7201  0.8198 
Std. Dev.  0.0003  0.0001  0.0003  0.0001  
NSGAII  Average  0.6916  0.7945  0.9312  0.8722 
Std. Dev.  0.0914  0.0312  0.0195  0.0099 
objectives  

algorithm  measure  3  4  6  10 
2ploids  Average  30.370  78.380  351.08  389.06 
Std. Dev.  0.0785  0.0782  0.2550  0.3853  
4ploids  Average  44.550  95.390  418.80  382.07 
Std. Dev.  0.1756  0.0828  0.3139  0.1801  
7ploids  Average  127.96  129.97  447.77  426.67 
Std. Dev.  0.1571  0.1657  0.1966  0.1564  
10ploids  Average  201.45  167.09  466.41  414.77 
Std. Dev.  0.3229  0.0905  0.2295  0.0647  
NSGAII  Average  157.00  219.09  442.94  485.17 
Std. Dev.  12.090  9.4720  25.340  3.6930 
3.4.5 Dtlz2
The second benchmark problem to be used is DTLZ2, which is shown in (3.28) and (3.29).
(3.28)  
(3.29) 
Where x and are a decision vector and a decision variable, respectively. The function requires variables, and the total number of variables is , where is the number of objectives.
The PF of this problem corresponds to , () which leads to and . For this problem, the number of decision variables is set to .
Figure 3.11 shows the distance to PF achieved by the algorithms against function evaluations. As with the case of DTLZ1, the 2ploids algorithm is the best performer. It is converging faster than the other algorithms used. However, the performance of the optimizers get closer as the number of objectives increase. In the case of 10 objectives (Figure (c)c) the 10ploids algorithm overcomes the 7ploids algorithm after 10,000 function evaluations. Figure (a)a–LABEL:sub@fig:EA_poly_DTLZ2_conv_10 shows how the performance of NSGAII is deteriorating by increasing the number of objectives, that in the case of 10 objectives the algorithm is diverging (Figure (c)c).
Table 3.7 shows the diversity of the different algorithms. Although NSGAII achieved the best diversity values in the 3, 4, 6 objective problems, its performance is declining with increasing the number of objectives. On the other hand, the performance of the Polyploid algorithms is either steady or improving with increasing the number of objectives. In the 3 objectives case, NSGAII has a superior diversity value of 0.799, while the 7ploids algorithm achieves a lower diversity value of 0.6452. But for the 10 objectives problem, the 7ploids algorithm overcomes NSGAII by achieving a value of 0.7398 while NSGAII achieves 0.7194.
objectives  

algorithm  measure  3  4  6  10 
2ploids  Average  0.6621  0.6594  0.6263  0.6872 
Std. Dev.  0.0208  0.0744  0.0543  0.0721  
4ploids  Average  0.6622  0.6539  0.6610  0.7198 
Std. Dev.  0.0114  0.0212  0.0416  0.0492  
7ploids  Average  0.6452  0.6807  0.6773  0.7398 
Std. Dev.  0.0432  0.0229  0.0175  0.0115  
10ploids  Average  0.5865  0.6459  0.6881  0.7153 
Std. Dev.  0.0449  0.0229  0.0232  0.0338  
NSGAII  Average  0.7997  0.7750  0.7479  0.7194 
Std. Dev.  0.0148  0.0159  0.0109  0.0082 
As shown in Figure (a)a,LABEL:sub@fig:EA_poly_DTLZ2_PF_2side, the 2ploids algorithm converged well to the PF of the 3 objectives problem, while NSGAII stood at a relatively larger distance from that front as shown in Figure (c)c.
The relative performance of the Polyploid algorithms in the DTLZ2 test problem is similar to that of the DTLZ1 test problem. Regarding diversity, increasing the ploidy number slightly affects the diversity of the Polyploid algorithms. The maximum change in diversity values is a decrease in the diversity of the 2ploids algorithm by 0.0756, while the other Polyploid algorithms almost maintained a constant diversity value. Regarding convergence, increasing the ploidy number always slows down convergence, except for the 7 and 10ploids algorithms in the 10 objectives problem as pointed out earlier.
To analyze the performance of the redundant chromosomes the following experiments are conducted. A new population is created by extracting all chromosomes in each solution vector in the Polyploid algorithms. This new population has a size of , where is the original population size and is the ploidy number. Then the average distance of the new population to the PF is calculated, and the percentage of the dominated solutions in the new population is evaluated. For each of the Polyploid algorithms in Table 3.8, the first row shows the average distance of the original population to the PF. The second row shows the average distance of the new population to the PF. The third row shows the percentage of dominated solutions in the new population (using Pareto dominance).
distance  objectives  

algorithm  to front  3  4  6  10 
original pop.  0.0515  0.1139  0.4381  0.9738  
2ploids  new pop.  0.0518  0.1145  0.4428  0.9929 
%dominated  7.52  6.25  4.89  2.11  
original pop.  0.0871  0.1469  0.7348  1.3115  
4ploids  new pop.  0.0890  0.1499  0.7528  1.3463 
%dominated  19.64  13.57  10.31  3.105  
original pop.  0.1402  0.2692  0.9319  1.6890  
7ploids  new pop.  0.1484  0.2785  0.9919  1.7293 
%dominated  40.17  20.55  13.43  3.26  
original pop.  0.3402  0.3358  1.2231  1.4619  
10ploids  new pop.  0.5006  0.3685  1.3160  1.4983 
%dominated  70.36  36.67  18.52  5.08 
As shown Table 3.8, the average distance of the new population is slightly worse than that of the original population. In the case of 2ploids algorithm with 3 objectives, the average distance is 0.0515 for the original population and it is 0.0518 for the new population. But in the case of 10ploids algorithm with 3 objectives, the average distance deteriorates from 0.34 to 0.5. Note that the percentage of dominated solutions is low in the case of 2ploids algorithm with a maximum value of 7.52% for the 3 objectives case. This value is steadily increasing with increasing the ploids number in all objectives cases reaching 70.36% dominated solutions for the 10 ploids algorithm with 3 objectives.
The new population which offers more solutions may be used instead of the original population for the 2ploids algorithm giving the decision maker more choices. It can be used in problems with high cost of function evaluations, as the extra solutions are produced without any extra computational cost.
3.4.6 Dtlz3
The DTLZ3 problem is given as follows:
(3.30)  
Comments
There are no comments yet.