1 Introduction
Genetic Algorithms are a popular optimization technique that have been used in domains from consumer goods quality assurance (Lee et al., 2016) and modeling traffic flows (Chiappone et al., 2016) to medical imaging (Pareek and Patidar, 2016) and materials science (Davis et al., 2016)
. Additionally, Genetic Algorithms are often used to optimize the parameters or structure of other machine learning algorithms
(Perreault et al., 2015). There has been a significant amount of work done on using genetic algorithms to test software (Jones et al., 1996; Wegener et al., 1997; Rao and Govindarajulu, 2015) especially on using genetic algorithms in mutation testing. However, there is nothing in the literature about testing the genetic algorithms themselves even though they are often used for critical applications such as those mentioned above. This is especially problematic given that genetic algorithm’s are being used as a tool to help medical diagnoses (Tan et al., 2003; PenaReyes and Sipper, 1999) and in other highstakes situations.There has been little work on testing machine learning algorithms in general (Xie et al., 2009). The reasons why so little work has been done in this area are myriad. The first reason is that most machine learning algorithms incorporate at least one random element. Random elements are challenging to test because the same input can produce different results (Guderlei and Mayer, 2007).
Furthermore, the correct answer is often unknown. For example, a person can design a genetic algorithm to optimize land zoning subject to multiple constraints (Stewart et al., 2004). In that instance, if the true answer was known, the land would already have been zoned. If the true answer is in fact known, like many of the optimization problems we discuss here, the true answer and the answer a genetic algorithm returns may not match, even if the genetic algorithm is correct. For example, in the Ackley’s function which we discuss later in this paper, a genetic algorithm may not find the true global minimum for the function at (0,…,0) for a number of dimensions. Instead the genetic algorithm may find one of the many local minima for the Ackleys function and return this value. Although this local minimum may not be the true answer, the genetic algorithm still may be correct.
Genetic algorithms are useful tools for approximation. However, they tend not to return exact answers, but an approximation of a good answer. If a genetic algorithm returned an exact and correct answer for a problem in which the answer was known, especially if it returned the same answer several times, we would probably suspect the genetic algorithm of having a coding error.
To test genetic algorithms we can use a testing technique called metamorphic testing, first proposed by Chen et. al. (Chen and Yiu, 1998)
. Metamorphic testing involves defining properties by which we can relate two or more outputs from an algorithm based on the input. If the outputs of two related inputs do not follow the property, there is an error in the program under test. Statistical metamorphic testing
(Guderlei and Mayer, 2007)is a useful technique when the program under test has random elements. Instead of an initial and followup test case, statistical metamorphic testing uses an initial and followup sample, which is then tested using statistical hypothesis testing.
We identified 17 metamorphic relations from the literature on genetic algorithms. We then used mutation testing to demonstrate the effectiveness of the various relations on two different implementations. We also demonstrated the effectiveness of three of these relations on two different differential evolution implementations. We found that traditional deterministic unit tests were not as effective at finding mutations as metamorphic relations, and that systemlevel relations and unitlevel relations perform well on different parts of the genetic algorithm. We also found that our three relations for differential evolution performed as well or better than the twelve relations for genetic algorithms.
The rest of this paper is organized as follows: Section 2 contains related work and background on metamorphic testing, genetic algorithms, and mutation testing. In Section 3, we discuss our implementation of a genetic algorithm. In Section 4, we identify 17 metamorphic relations, five for fitness functions, three for genetic algorithm operators, and nine systemlevel level relations. In Section 5, we lay out several experiments we conducted. Finally, Section 6 contains our conclusions and future work.
2 Background and Related Work
Genetic algorithms are often used in software testing. The most common uses of genetic algorithms are in test case generation (Geronimo et al., 2012), multiobjective test case generation (Henard et al., 2013), test case prioritization (Sharma et al., ????), and mutation testing (Rao and Govindarajulu, 2015). There has been no work published in the literature on how to test genetic algorithms. However, Arcuri and Briand (Arcuri and Briand, 2014) state that many randomized algorithms (like a genetic algorithm) are used in software applications. Additionally, Arcuri and Briand argue for the use of statistical testing in software applications that use random algorithms.
Metamorphic testing is a testing technique used when the algorithm is nondeterministic and/or when there is no way to determine the correct output (Chen and Yiu, 1998). To conduct metamorphic testing, first one defines a metamorphic relation (or a set of metamorphic relations) for the program under test. A metamorphic relation is a property by which we can relate two outputs based on the input. Then, tests are defined for each metamorphic relation. Each test consists of an initial test case and a followup test case based on the metamorphic relation. Tests are then executed on the program under test. The output of the initial and followup test cases are evaluated to determine if the text cases follow the metamorphic relation.
When the algorithm is stochastic, one cannot simply check if the result of the followup test is equal to the expected value. The result may be close, but because of the stochasticity, the result will not be exactly equal. Therefore, statistical tests are applied to determine whether the difference between the expected output and the actual output is statistically significant. This is called statistical a relationship between two samples, an initial and a followup test sample, is specified, usually taking the form of null and alternative hypotheses. The output of the program generates the samples and is compared using statistical hypothesis testing.
Genetic Algorithms all consist of an encoding of potential solutions as chromosomes, e.g. a bit string, and a fitness function (Mitchell et al., 1997), which is a function that evaluates each potential solution and returns a value based on that evaluation. A generic genetic algorithm has a population of properly encoded potential solutions which is usually randomly generated, and a randomized way or ways to change potential solutions to see if the fitness of that solution improves. Unfit solutions are removed from the population.
Testing a genetic algorithm presents several problems. Genetic algorithms are randomized algorithms. Each time the genetic algorithm is run, the output of the algorithm will be different. There is also no way to determine the ”correct” output of a genetic algorithm. In many cases, if the target output of a genetic algorithm is known, there is often no reason to use a genetic algorithm at all. Furthermore, if the data provided to the genetic algorithm is misleading in some way, the output of the genetic algorithm will not match the desired output. Thus, testing approaches that require one to compare an output against a true answer will not work for testing a genetic algorithm.
In this paper, we use several problems where the correct answer is already known in order to demonstrate the process of testing a genetic algorithm. However, we also provide ways to expand the testing process in the case where the target answer is unknown.
There has been no published research on how to test genetic algorithms. However, Xie et al. (Xie et al., 2011)
defined several possible metamorphic relations to use when testing a machine learning classifier. In machine learning, classification is the problem of delineating decision boundaries so that all examples inside of a boundary are of one class. Genetic algorithms have been shown to be effective classifiers in a multitude of cases
(Ishibuchi et al., 1997; Corcoran and Sen, 1994). Xie et al. (Xie et al., 2011)demonstrated how to test two other machine learning methods, the kNearest Neighbor and the Naive Bayes Classifiers, using metamorphic testing. Xie et al. also defined several possible metamorphic relations to use when testing a machine learning classifier. These authors showed that metamorphic testing is an effective way to test machine learning classifiers. However, they only tested this approach on WEKA
(Hall et al., 2009), an open source tool for performing classification, regression and other data mining tasks.Shin Yoo (Yoo, 2010) showed how one could use metamorphic testing to validate a machine learning approach, called simulated annealing, to an optimization problem. Optimization problems consist of an objective function, that we must either minimize or maximize subject to constraints. Shin Yoo showed that metamorphic relations can be an effective way to test machine learning approaches to optimization problems, especially for certain kinds of faults. Genetic algorithms are another way to approach an optimization problem.
Murphy et.al. (Murphy et al., 2009)
showed how an automated metamorphic testing framework can be used to test support vector machines, decision trees, and ranking algorithms. The most complex algorithm tested by Murphy et. al. is the MartiRank algorithm. This algorithm is a type of ensemble method that divides the data into a series of sublists which it then orders according to the ”quality” of the features, similar to a fitness function, except that it is iteration dependent. At each iteration, the model describes how to divide the data into lists and updates the quality measure. When all the rankings are completed, the algorithm reconstructs a final ranking based on the divisions and quality measures of previous iterations. Murphy et. al. designed an automated metamorphic testing system to improve the speed at which metamorphic tests can be developed.
None of these papers test very complicated algorithms. The most complex of these is the MartiRank algorithm, which is based on a series of simple ranking algorithms. Genetic algorithms involve many more random components than even the most complex algorithm tested to date. In addition, Murphy et. al. and Xie et. al. used the Weka (Hall et al., 2009) implementation of these algorithms. There is no Weka implementation for a genetic algorithm. All of this means that there is a great need for testing genetic algorithms thoroughly and until now there has been no established way to accomplish this task.
In order to demonstrate the effectiveness of the testing approach outlined in this paper, we will use mutation testing. Mutation testing is technique that has been shown to be effective for comparing testing techniques (Andrews et al., 2005). The first step in mutation testing is to generate a number of mutants, given the source code to the program under test. These mutants are identical to the program under test except that one line has been changed. Next, the test set is run on each mutant. If the test set detects the changed line (in other words, if at least one test fails with the changed line where the test passed without the line being changed), the mutant is said to have been ’killed’. If the test set does not detect the changed line, the mutant is said to have ’survived’. Compilation and runtime errors and tests that time out are considered ’killed’ as well. Then, the number of killed mutants is counted and is divided by the total number of mutants. This gives a mutation score. A perfect mutation score would be 1, or all of the mutants detected by the test set. However, a perfect mutation score is usually impossible. Consider the example of a for loop in a Java program.
If the code above is the original program and the following is the mutant:
there would be no way to detect the mutant as it would be functionally equivalent to the original program. However, most mutants would not be functionally equivalent and could be detected by the test set, if an appropriate test exists. Therefore, the goal of mutation testing is to get the mutation score as close to 1 as is possible by changing tests so that they detect more mutants. In this paper, we compare the mutation score of different types of tests in order to determine which tests detect the most number of faults.
3 Genetic Algorithm Implementation
Most often genetic algorithms used in classification are included as part of a larger solution. For example, a common use of genetic algorithms is to train the weights of a neural network. Genetic algorithms in an optimization context usually do not use any other algorithms. Additionally, it is trivial to turn a minimization problem into a maximization problem. This is important for the future generalization of the testing technique we develop here. Therefore we implemented a genetic algorithm that optimizes a function.
A typical example of a genetic algorithm can be seen in Algorithm 1. The generic genetic algorithm takes as input a fitness function, a population size and a fraction of the population to be replaced at each time step, a mutation rate and a fitness threshold (Mitchell et al., 1997). The output will be a set of real numbers that produce the “best” output of the fitness function. The first step is to randomly initialize potential solutions, and then evaluate the fitness of each potential solutions. Then while the termination criteria are unmet, a number of potential solutions are selected from the population, recombined to form children and then mutated. Selection, crossover (also called recombination), and mutation are considered the operators for the genetic algorithm. Then each child’s fitness is evaluated and some number in the original population will be replaced by the child.
Our genetic algorithm uses uniform crossover with 2 parents and fitness proportionate selection. Fitness proportionate selection associates a probability of selection with a particular individual in the population (Hancock, 1994). If is the fitness of individual , the probability of selection is
where is the number of individuals in the population. Fitness proportionate selection tends to be relatively slow (Goldberg and Deb, 1991) and has a risk of slow convergence (Mitchell et al., 1997). Uniform crossover allows parents to contribute individual genes to a child individual rather than sequences of genes. This eliminates positional bias (Mitchell et al., 1997) which was necessary as positional bias could interfere with certain metamorphic relations. The probability that a parent will contribute a particular gene to the child is where represents the number of parents.
When designing a genetic algorithm for optimization, if there is a single objective function and no constraints, creating a fitness function is very simple. We simply use the objective function itself as a measure of fitness. It is also trivial to regularize the output. In this project we will test both stochastic and deterministic objective functions.
In addition to implementing our own Genetic Algorithm, we contacted Strasser et.al. (Strasser et al., 2016) who allowed us to use their Factored Evolutionary Algorithms framework (FEA framework) as an additional implementation to test.
3.1 Differential Evolution
Differential Evolution (DE) is another evolutionary algorithm that was designed for continuous spaces. DE centers around creating offspring by conducting crossover on parents and what is known as a “trial vector” (Storn and Price, 1995). A trial vector is similar to the child concept in genetic algorithms. A trial vector is generated by choosing 3 individuals – call them , , and – from the population, without replacement, where , and are randomly selected. The trial vector, , is then
(1) 
The subtraction term generates what is known as a “difference vector”, where the multiplier is some userdefined positive number. Crossover then combines and into an offspring, which has the effect of “pushing” in a particular direction in the search space. There are multiple forms this crossover can take; we chose to implement binomial crossover, which is similar to uniform crossover in a genetic algorithm.
Because of the representation of our chromosomes, we used binomial crossover. We used a single difference vector, and to select the target vector we used random selection. As such our differential evolution algorithm could be described as . We implemented differential evolution ourselves and used the differential evolution algorithm from the FEA framework.
3.2 Test Problems
There are a variety of established test problems used in order to assess the ability of the genetic algorithm to optimize. For the purposes of this project, we wanted problems that were continuous and scalable to more than three dimensions. Differential evolution operates almost exclusively in continuous spaces. Although the current genetic algorithm will be able to optimize in discrete and categorical spaces, we wanted problems that would also work for the differential evolution algorithm. We need problems that are scalable to more than three dimensions because most realworld problems in machine learning have feature spaces (or numbers of dimensions) much greater than three.
We ran the genetic algorithm on a variety of problems in order to show how the metamorphic relations would change, or remain constant depending on the problem. To that end, we wanted at least one random function. We also wanted at least one function with multiple local minima, because local minima have a tendency to ”trap” genetic algorithms. For this project, we selected three test problems from the literature (Jamil and Yang, 2013), Ackleys function, the Quartic function, and the Rosenbrock function. Ackleys function is a continuous, deterministic, scalable function that has multiple local minima. This function fulfills our requirement for a function with multiple local minima.
(2) 
The Quartic function is a continuous, scalable, stochastic function that, because of the random element, may or may not have local minima. This function fulfills our requirement for a random function.
(3) 
The Rosenbrock function is a continuous, deterministic, scalable function with a single minimum. This function has neither a random element, nor multiple local minima. This means it is a useful function for comparisons against both the Quartic and Ackleys functions.
(4) 
Class  LOC  # of tests  # of trad. tests  # of MRs 

Chromosome  22  3  2  1 
Fitness Function  14  1  0  1 
Ackleys  17  7  5  2 
Quartic  11  6  3  3 
Rosenbrock  18  6  5  1 
Genetic Algorithm  112  14  5  9 
Differential Evolution  82  5  2  3 
Total  271  42  22  20 
Total without DE  189  37  20  17 
The genetic algorithm we implemented is organized as follows: We used three fitness functions in this work, Ackleys, Quartic and Rosenbrock functions. The Chromosome class encapsulates a potential solution to a particular problem. The Genetic Algorithm class is the biggest class by far. It contains a list of Chromosomes which represent the population. It also contains all the operators for the Genetic Algorithm, a multitude of getters and setters, and the genetic algorithm itself, modeled on the genetic algorithm we outlined in Algorithm 1.
The Differential Evolution class is similar to the Genetic Algorithm class. For most analyses, the Differential Evolution class was not included.
The FEA framework is a much bigger program, containing over 5000 lines of code. For comparison, our implementation used only 271 lines of code. However, much of the FEA framework code implements other algorithms such as Particle Swarm Optimization or other test problems such as the Rastrigrin Function. The lines of code directly relating the the genetic algorithm, the differential evolution algorithm or the three fitness functions only amounts to 245 lines of code. The FEA framework can be loosely organized into the genetic algorithm, the fitness functions and the differential evolution algorithm.
4 Metamorphic Relations
In this section we lay out 17 metamorphic relations. Three of these relations also applied to differential evolution.
4.1 Metamorphic Relations for Fitness Functions
The fitness functions are very different from the rest of the genetic algorithm because we do have the true answers for two of the functions. For Ackleys function, the minimum output is 0, and this occurs when the input is (0,0,…,0) for all dimensions. The maximum value of Ackleys function is approximately 22.3. Since there are many peaks in Ackleys function, there are many different sets of inputs that could reach this value. In two dimensions, one of those input sets is at (21.6, 31.5) when the number of dimensions is 2. For the Rosenbrock function, the minimum output is 0. This occurs when the input is (1,1,…,1) for all dimensions. The maximum output is approximately . These values were tested using deterministic unit tests.
Relation 1.1
Unit tests for the Quartic function were tested using metamorphic relations. Since the Quartic function adds a random number for each dimension, we do not know what the exact value of the Quartic function will be. However, we do know that the minimum value without the random numbers would be 0. We also know that the random numbers will all be strictly less than 1. Therefore, we know that the minimum value for the Quartic function will be where is the number of dimensions. In certain circumstances, other inputs will cause the output to be in this range. The follow up test will be 0’s and the last input value is 1. This input will produce a result always larger than 0’s. As an example, in 4 dimensions, the initial test case would be (0,0,0,0) and the followup test case would be (0,0,0,1). With these inputs, the output from the initial test case would always be less than the output from the followup test case.
Relation 1.2
The maximum value the Quartic function would take if the random elements were not added is . This happens at (1.28, 1.28, …, 1.28), or (1.28, 1.28, …, 1.28) for all dimensions. The random number generator includes 0, so the maximum value is always greater than or equal to this number. Since there are
random numbers added, and the random numbers are pulled from a uniform distribution, the mean maximum value will be
and the variance is
. For this test, since we have fairly complete knowledge of the distribution, we could have used simple statistical tests. However, we used statistical metamorphic testing (Guderlei and Mayer, 2007)because it is not always possible to know the distribution of the fitness function. To perform the statistical metamorphic tests, we generated two samples, each with 20 observations, by running the Quartic function on the input (1.28, 1.28, …, 1.28) 20 times for each sample (e.g (1.28, 1.28) for 2 dimensions) and recording the output of the function. Our null hypothesis was that the mean of the two samples would not be equal to each other. Our alternative hypothesis was that the means would be equal to each other.
Relation 1.3
One of the metamorphic relations often mentioned is changing the order of the attributes (Xie et al., 2011). For Ackleys function, this is a valid metamorphic relation that we use in the testing process. As an example, an initial input of (6.4, 2.5, 1.25) and a followup input of (1.25, 2.5, 1.28) will both produce an output of 13.24197384. However, for the Quartic function and the Rosenbrock function, this is invalid. Consider the following example for the Quartic function in 3 dimensions. The first input is (0.25, 0.5, 1.28). This produces a result of 9.94196993. The second input is (1.28, 0.5, 0.25). As you can see, only the order changes between the two inputs. However, the output for the second input is 4.64107331, less than half the output of the first example.
Relation 1.4
Another metamorphic relation for the Quartic and Rosenbrock functions is to compare the maximum output of the fitness functions given some potential solution inside of the range of expected values, given some potential solution outside of the range of expected values, and finally the first solution again. The fitness function class will adjust the maximum output if some solution produces a higher output than is currently the case. This happens so that we do not get a fitness greater than the maximum fitness. The fitness should change between the first and the last test, and the fitness of the first test should be higher. This does not apply to the Ackleys function because the values outside the range of expected values are not necessarily higher or lower than those inside the range. Again, we used statistical metamorphic testing in this relation. We generated 2 samples, each with 20 observations. The initial sample was generated by feeding the maximum input (e.g. (30, 30, 30) for 3 dimensions in the Rosenbrock function) into the function 20 times. Next we ran the function with a single input that was larger than the maximum input (e.g. (80, 80, 80) in 3 dimensions for the Rosenbrock function). Finally, we generated the followup sample by running the function on the same input as the initial sample. Our null hypothesis was that the mean of the two samples would be equal to each other, while our alternative hypothesis was that the mean of the initial sample would be greater than the mean of the followup sample.
Relation 1.5
Since all of our fitness functions can generalize to more than three dimensions, another metamorphic relation for the fitness functions is to create a fitness function object for two dimensions and input a good solution in two dimensions, say (1,1) for the Rosenbrock function. The followup test would be to create a fitness function object for greater than two dimensions and input the same good solution, simply with more dimensions, e.g. (1,1,1,1) for the Rosenbrock function. The scaled fitness for the test and the followup test will be the same.
4.2 Metamorphic relations for genetic algorithm operators
4.2.1 Metamorphic relations for mutation
Relation 2.1
Let us assume that the mutation rate is 1. Let us also assume that we have a solution with dimensions. We then run our mutation operator on our solution. Each observation of a value in the mutated solution will be different from the corresponding unmutated observation by some amount. This amount will be greater than or equal to 0 and less than 0.1. The average amount will be 0.05. An example input initial test case would be the randomly generated solution (3, 27, 6, 14, 30, 16, 1, 16, 10, 29) for 10 dimensions. We then run the mutation operator, with the mutation rate set to 0.1 on this solution and record the difference between the input and the output. The followup test case would use another randomly generated solution with the mutation operator set to 0.9 and the difference between the input and output of the mutation operator would be recorded. Our null hypothesis was that the mean difference between the input and output for the two test cases would be equal. Our alternative hypothesis was that the mean difference for the followup test case when the mutation rate was set to 0.9 would be greater than the mean difference for the initial test case.
4.2.2 Metamorphic relations for crossover
Relation 2.2
The crossover operator takes as input a set of ’parent’ solutions and outputs a ’child’ solution that is a combination of elements of the parent solutions. For each dimension in the solution, the probability that the parent will contribute their value for that dimension is . Given a set of parents that are unique (i.e. no common elements between the parents), we can determine which element of the child solution came from which parent. This relation also used statistical metamorphic testing. With parents (1,2,3,4) and (5,6,7,8), our initial test case would set the crossover rate to 0.5 and run the crossover operator 20 times, generating 20 children. The followup test case would set the crossover rate to 1 and again generate 20 children. We would then calculate the proportion of elements in each child that came from the first parent. Our null hypothesis was that the average proportion for the initial sample would be equal to the average proportion for the followup sample. Our alternative hypothesis was that the average proportion for the initial sample would be greater than the average proportion for the followup sample. This metamorphic relation also applies to Differential Evolution.
4.2.3 Metamorphic relations for selection
Our implementation of the genetic algorithm uses fitness proportionate selection. This means that each solution’s fitness is scaled by the total fitness of the population. These scaled fitnesses all add up to 1. More specifically, this algorithm uses roulette wheel selection. This means that we create a scaled fitness vector, and we select a potential solution based on these scaled fitness vectors. Smaller fitness values are more likely to be selected, but in order to maintain diversity in our population of solutions, there must be some chance that solutions with higher fitness can be selected.
Relation 2.3
This metamorphic relation involves running the selection operator several times on two populations, one that contains several copies of the ideal solution, and one that does not but is identical in every other way. For an example with the Rosenbrock fitness function and 2 dimensions, our initial test case might have the individuals (2,3), (5,10), (27, 8), (17, 11), and (29, 2). The followup test case might have the individuals (3,4), (5,10), (17,11), (1,1), and (1,1). The average fitness for the initial test case would be worse than the average fitness for the followup test case. If the ideal solution is unknown, a good solution, as determined by the fitness function, could be used instead of the ideal solution.
4.3 Systemlevel metamorphic relations
Relation 3.1
The number of generations we allow the genetic algorithm to run is a crucial parameter. If we increase the number of generations, the average fitness will improve, unless the fitness threshold is reached, and the algorithm exits early. One way to prevent the early exit is to select a test problem that has many local optima so as to prevent early convergence, such as the Ackley’s function. An example initial test case is setting the number of generations to 50. An example followup test case would be setting the number of generations to 5000. This relation applies to Differential Evolution.
Relation 3.2
Another crucial parameter is the population size. As the population size increases, the average fitness will also improve. This is because with increased individuals in the population, we increase the chance that one of those individuals will encounter the true solution. For an initial test case, we used a population size of 5. The population size for the followup test case was 500. The average fitness for the followup test case will be better (lower) than the average fitness for the initial test case. Interestingly, for Differential Evolution smaller population sizes improve average fitness more than larger population sizes (assuming some low number of iterations and/or a problem with many local minima). This could be because the decreased diversity of the population leads to a faster decrease in fitness. Alternatively, there could be an interaction between the population size and another important parameter, that influences the ideal size of the population. We used the same initial and followup test cases for differential evolution. The average fitness for the followup test case will be worse (larger) than the average fitness for the initial test case.
Relation 3.3
Finally, if we increase the threshold parameter, the average fitness will be worse, but the average number of iterations run by the algorithm will decrease. This is because when the threshold is set at a higher value the algorithm reaches the threshold relatively quickly and exits. When the threshold is set at a lower value, the algorithm continues searching until it reaches the lower value, so the fitness will be lower (better). Our initial test case was 0.5 and our followup test case was 0.05. The average fitness for the initial test case was greater than the average fitness for the followup test case. The average number of iterations run for the initial test case was less than the average number of iterations run for the followup test case.
The other two parameters, mutation rate and replacement rate, are much more difficult to define relations for. In Differential Evolution, and the crossover rate can be seen as approximate substitutions for mutation and replacement rates. As you can see from Figure 1, when both of these parameters are at 0, we will see no improvement in total fitness in the population over the generations. As we increase the parameters, we will see improvements in the total fitness and the best fitness will be reached in fewer iterations. At a certain point, total fitness will oscillate. If the oscillations are small, we should still see convergence towards our ideal solution. However, these oscillations, no matter how small, will lead to a greater number of generations needed to find the ideal solution. This point where total fitness begins to oscillate is not easily identifiable and is thought to depend on the problem. We can see that in this instance, unstable oscillations occur when the mutation rate is at 0.8 and the replacement rate is at 2. When the unstable oscillations occur, overall fitness increases rather than decreasing. It is unclear how the mutation rate and replacement rate interact to speed or slow the rates of convergence towards the best solution in general. However, we can still define some metamorphic relations for mutation rate and replacement rate.
Relation 3.4
When both parameters are at 0, the average fitness will be worse than if both parameters are at 0.5. This is because the final solution depends only on the randomly initialized potential solutions, rather than the changes made to those solutions. This applies to crossover and in Differential Evolution.
Relation 3.5
On the other hand, if both parameters at 1 the average fitness does not follow the same behavior. We think this is because each time mutation happens, the solution will only change by a small amount. Recombination never adds new solutions. Thus each solution is only being changed by a small amount. It is possible that the mutation operator can be changed so that setting both parameters equal to 1 will be roughly equivalent to a random search. This relation was tested in section 5.4, but was not tested in the remainder of the experiments.
Relation 3.6
If we hold mutation rate constant at 0, and increase the replacement rate to 0.5, the average fitness of the best solution in the population will be better than when both mutation rate and replacement rate are at 0. Performing Relation 3.6 with mutation and replacement rate switched, however, does not produce the same results. This is because mutation happens only when a potential solution is selected, so the mutation rate depends on the selection rate.
Relation 3.7
If however, we hold the recombination rate constant at some low number, say 0.1, and set the mutation rate at 0 and 0.5, for example, the average fitness for the best solution will be better when the mutation rate is at 0.5 than when the mutation rate is set at 0.
Relation 3.8
For this reason, the common wisdom in the genetic algorithm literature is that mutation rate should be set lower than the replacement rate (Deb and Agrawal, 1998). If this common wisdom holds, one metamorphic relation for mutation is to run the whole algorithm when the parameters follow this common wisdom, and then run the algorithm again when the values for the parameters are swapped. The average fitness for the best solution when the parameters follows this common wisdom should be better than when the parameters do not follow this common wisdom. However, when the values are 0.1 and 0.8, this is not the case. Much of the time, the swapped parameters (i.e. the mutation rate set higher than the replacement rate), performed better than the parameters that followed the common wisdom. This high rate of failure is one reason this test was not used in the analyses, other than in section 5.4.
Relation 3.9
On the other hand, we can test this type of interaction between the replacement and mutation rates by using the values 0 and any other number strictly greater than 0 and less than or equal to 1. In that case, the algorithm should behave like Relation 3.6 with the parameters switched. In other words, the average best fitness for this algorithm will be no better than when both parameters are set at 0.
4.4 Deterministic unit tests
We implemented several unit tests for both the genetic algorithm and differential evolution that did not use metamorphic relations. These relations included testing initialization, selecting the best chromosome from the population, that fitness is changed when the update fitness function was called, and checking constant and known values. These were used for comparisons against the metamorphic relations. The scope of these deterministic tests is limited, but testing many parts of a genetic algorithm is difficult, if not impossible, without metamorphic testing.
5 Experiments
In order to conduct mutation testing, we used the PIT mutation testing tool (Coles, 2012). Although the PIT tool does not include source code of the mutants, it does include the line number where the mutation happened and the type of mutation (e.g. negated conditional, replaced addition with subtraction). PIT reports each mutant that survived, was killed, timed out or was not covered for each class. For each test, 182 mutants were generated. PIT uses Line coverage to assess mutants not covered by test cases. The PIT mutation tool generates the following types of mutants:

Replaced operator () with another operator.

Changed conditional boundary.

Negated conditional.

Changed increment.

Mutated return value.

Removed call to other function.
Unless otherwise specified by the metamorphic relation, the parameters for the genetic algorithm were as follows:

.

.

.

.

.
Statements  Total  Deterministic  Total  Function  System  Total 

Covered  Tests  Tests  MRs  Level MRs  Level MRs  Lines 
Our Implementation  243  87  169  136  165  194 
FEA Framework  125  119  125  124  125  245 

5.1 Overall results
Relation 2.3 detected a fault in our original implementation of the genetic algorithm. The fault in the selection function occurred because we were prioritizing higher fitness, rather than lower fitness. Once we had fixed this error, we generated mutation scores for all types of tests. As shown in Table 3, the overall mutation score is 85% for all the tests written. Several classes received a mutation score of 100%. The vast majority of mutants that were not killed were due to there being no coverage, or tests, for the code that was changed (44 of the 59 mutants that were not killed). This is a problem that can be easily remedied with more tests.
All Tests  Deterministic Tests  All Relations  
Genetic  90/117  20/117  90/117 
Algorithm  77%  17%  77% 
Chromosome  14/15  8/15  11/15 
93%  53%  73%  
Fitness  9/11  5/11  7/11 
Function  81%  46%  64% 
Ackleys  13/14  11/14  12/14 
Function  93%  79%  86% 
Quartic  8/8  0/8  8/8 
Function  100%  0%  100% 
Rosenbrock  18/18  16/18  18/18 
Function  100%  89%  100% 
Combined  104/132  28/132  101/132 
Gen. Algo.  79%  21%  77% 
Combined  48/51  32/51  45/51 
Fit. Func.  94%  63%  88% 
Total  152/182  60/182  147/182 
84%  33%  81%  

5.2 Deterministic and Metamorphic Comparison
We next divided the tests into deterministic and metamorphic tests. All tests were run again and the mutation score was calculated for each type of test. For the deterministic tests, the mutation score was 33%, as seen in Table 3. This is quite low, due at least in part to not being able to test the random elements of the genetic algorithm. The mutation score for metamorphic testing was 81%. Part of the reason for the big disparity was potentially due to the fact that we have identified two different types of metamorphic tests, functionlevel and systemlevel metamorphic relations.
All Relations  FunctionLevel  SystemLevel  
Genetic  90/117  48/117  74/117 
Algorithm  77%  41%  64% 
Chromosome  11/15  11/15  11/15 
73%  73%  73%  
Fitness  7/11  7/11  6/11 
Function  64%  64%  55% 
Ackleys  12/14  12/14  8/14 
Function  86%  86%  57% 
Quartic  8/8  8/8  8/8 
Function  100%  100%  100% 
Rosenbrock  18/18  18/18  7/18 
Function  100%  100%  39% 
Combined  101/132  59/132  85/132 
Gen. Algo.  77%  45%  64% 
Combined  45/51  45/51  29/51 
Fit. Func.  88%  88%  57% 
Total  147/182  104/182  114/182 
81%  57%  63%  

5.3 Systemlevel and Functionlevel comparison
Since the deterministic tests were implemented only at the function level and there were no systemlevel deterministic tests, we divided the metamorphic relations into systemlevel and function level tests. The mutation score for the function level metamorphic relations was 0.577, from Table 4. This was higher than the deterministic tests. We were originally concerned that this was due to differences in coverage of the different types of tests. However if we approximate a normalization by dividing the number of mutants killed by the lines of code covered, we see that the deterministic tests only scored 0.588, while the function level metamorphic relations obtained a score of 0.772 (the best possible score would be 1.055). This means that the deterministic tests had a lower mutation score, even when measured relative to statement coverage. The mutation score for the systemlevel relations was even higher than for the functionlevel relations, at 0.648. Most of what is driving that number is the higher coverage for the genetic algorithm class. As you can see in Table 4, when separated out by class, the whole algorithm relations had a mutation score of 0.638 for the Genetic Algorithm class, while the functionlevel relations only had a mutation score of 0.414. On every other class, the functionlevel relations had higher mutation scores.
5.4 Systemlevel Tests with Different Fitness Functions
Finally, we examined the failure rates of the whole algorithm tests. We expected that some small number of tests would fail if the tests were run a sufficient number of times, given the nature of statistical tests. However, we suspected we were seeing too many failures for statistical likelihood. If this was due to an actual error in the genetic algorithm, we expected that the test failure would be consistent. The failures we were seeing occurred inconsistently. One potential cause of this was the fact that we were using the Ackleys function to set up the tests for several metamorphic relations. Ackleys function has many local optima. We hypothesize that if there were more failures than expected, these failures were due to the Ackleys Function getting stuck in local optima and not converging towards a global optima.
We restricted the set of tests run to the systemlevel metamorphic relations, specifically to Relations 3.1, 3.2, 3.3, 3.4, 3.5, and 3.8. We ran each set of relations ten times. We then changed the fitness functions for each relation and ran the relations another ten times. For example, we would run Relation 3.1 ten times with Ackleys function, then ten times with the Quartic function, and finally ten times with the Rosenbrock function. If statistical tests were causing the failures, we expected that failure would occur less than one time for each fitness function with each relation.
3.1  3.2  3.3  3.4  3.5  3.8  

Ackleys  7  0  0  0  2  8 
Quartic  6  0  0  0  3  5 
Rosenbrock  0  0  0  0  4  5 
As you can see from Table 5
, only 3 relations failed at all, and all the failing relations failed much more than we would expect given a 95% confidence interval. Relations 3.2, 3.3 and 3.4 succeeded consistently. We restricted our further tests to Relations 3.1, 3.5 and 3.8. Relation 3.1 failed only when run with Ackleys and Quartic functions. This fit our hypothesis that failing tests were getting stuck in local optima. We ran this relation 20 more times with just the Rosenbrock function and found that it did not fail in any of those runs. However, running Relations 3.5 and 3.8 with the Rosenbrock function did not alter the amount of failures seen. One option for Relation 3.5 was to increase the average alteration made to each value performed by the mutation operator, in hopes that it would improve the failure rate. However, this would require altering Relation 2.1 to match. Instead we developed Relations 3.6 and 3.7 to substitute for this relation. Relation 3.8 was the most problematic relation, also probably requiring a change to the mutation operator. We identified Relation 3.9 to replace Relation 3.8. After we replaced the faulty relations, we ran Relations 3.1, 3.2, 3.3, 3.4, 3.6, 3.7, and 3.9 another twenty times each, and saw no more failures.
5.5 FEA framework
We used the same tests and relations for the FEA framework as we did for our implementation. The authors of (Strasser et al., 2016) had previously constructed 70 unit tests for the FEA framework. All but 2 of the unit tests were irrelevant to the genetic algorithm, fitness functions and differential evolution. As you can see in Table 6 the whole algorithm relations achieved the highest mutation score in the genetic algorithm classes, while the unit level relations achieved higher mutation scores for the fitness function classes. Both types of relation had higher mutation scores than the deterministic unit tests. Since more deterministic tests were implemented, but the mutation scores for the relations was higher, we can surmise that the metamorphic relations were far more effective than the deterministic unit tests. When we compared the results of our implementation to the FEA framework we found lower mutation scores for the FEA framework, as seen in Table 8. This suggests that we were possibly targeting our relations to the implementation. However, the mutation scores for the metamorphic relations were still higher than the mutation scores for the deterministic tests, suggesting that these relations are still more effective at detecting errors than deterministic tests.
Initial  All Tests  Deterministic Tests  All Relations  

Genetic  0/89  62/89  35/89  62/89 
Algorithm  0%  70%  39%  70% 
Fitness  24/71  47/71  35/71  43/71 
Functions  33%  66%  49%  60% 
Total  24/160  109/160  70/160  105/160 
15%  68%  44%  66% 
All Relations  FunctionLevel  SystemLevel  

Genetic  62/89  37/89  62/89 
Algorithm  70%  42%  70% 
Fitness  43/71  39/71  36/71 
Functions  60%  54%  50% 
Total  105/160  76/160  98/160 
66%  48%  61% 
Initial  All  Deterministic  MRs  Function MRs  System MRs  

FEA 
15%  68%  44%  66%  48%  61% 
framework  
Our  0%  83%  33%  81%  58%  64% 
impl. 
5.6 Differential Evolution
We ran five tests from the genetic algorithm on the differential evolution algorithm on each implementation. Two of these tests were deterministic while the remaining three were metamoprhic relations. Tests performed much better than expected on the differential evolution algorithm, given that all but one of the tests was virtually identical to the genetic algorithm tests. The mutation scores for all tests for the FEA framework implementation of differential evolutions was 95% while for our implementation the mutation score for all tests was 86%. This was the opposite of the genetic algorithm tests in that the FEA framework tests performed better than the tests on our implementation. Additionally, in the FEA framework, differential evolution tests outperformed our implementation of the genetic algorithm in terms of mutation score, while our implementation of the genetic algorithm slightly outperformed our implementation of differential evolution. This increase in performance is intriguing, although it probably can be attributed to differential evolution being a simpler program with fewer random elements.
All  Deterministic  MRs  

FEA 
37/39  19/39  31/39 
framework  95%  49%  79% 
Our  50/58  7/58  42/58 
impl.  86%  12%  72% 
6 Conclusions
6.1 Contributions
In this work we identified metamorphic relations of genetic algorithms, and genetic algorithm operators. Additionally, we have identified metamorphic relations of genetic algorithms that translate to differential evolution, and may be more illustrative in differential evolution than in genetic algorithms. We defined 17 metamorphic relations, five for the fitness functions, three for the genetic algorithm operators and nine for the whole algorithm.
We compared the metamorphic relations to the deterministic unit tests for the genetic algorithm. We found that metamorphic relations for our implementation had a mutation score of 81% while traditional deterministic unit tests had a mutation score of 33%. We also compared the functionlevel metamorphic relations to the systemlevel metamorphic relations on the genetic algorithm for our implementation. Functionlevel metamorphic relations had a mutation score of 57% while systemlevel metamorphic relations had a mutation score of 63%. For the FEA framework, the mutation score for the metamorphic relations was 66% while the mutation score for the deterministic tests was 44%. When comparing systemlevel and functionlevel metamorphic relations for the FEA framework, the mutation score for systemlevel metamorphic relations was 61%. The mutation score for the functionlevel metamorphic relations was 48%. Additionally, we modified two relations that failed more often than was statistically likely on both implementations of the genetic algorithm when no fault was present. Finally, we used 5 tests, two deterministic unit tests and three metamorphic relations, on the differential evolution algorithm. The mutation score for the metamorphic relations was 79% for the FEA framework and 72% for our implementation. The mutation score for the deterministic tests was 49% for the FEA framework and 12% for our implementation.
These comparisons demonstrated the effectiveness of the metamorphic testing approach when testing genetic algorithms. We assessed the mutation score relative to the statement coverage, and found that functionlevel metamorphic relations performed better than functionlevel deterministic tests, despite there being more deterministic tests. This result was consistent across both genetic algorithm implementations we tested. We examined the failure rates of the systemlevel relations when initialized with various fitness functions. We found one relation that only performed well when paired with a particular fitness function. Additionally, we found two relations that do not perform well no matter the fitness function. We developed two new relations to replace these problematic ones.
6.2 Future Work
Future work for metamorphic testing on genetic algorithms would include identifying individual metamorphic relations that are either more generalizable, or kill more mutants than other metamorphic relations. We also plan to test these relations on different types of operators and different purposes for a genetic algorithm. We plan to develop relations for other operators, purposes, or algorithms. We can then also test those relations on this genetic algorithm implementation. We would like to identify types of mutants that survive more often, in order to identify metamorphic relations that are able to target these mutants, or show that these mutants are more likely to be equivalent. Finally, any future work would have to include the identification of more metamorphic relations, and the implementation of more tests.
References
 Lee et al. (2016) C. Lee, K. Choy, G. Ho, C. Lam, A slippery genetic algorithmbased process mining system for achieving better quality assurance in the garment industry, Expert Systems with Applications 46 (2016) 236–248.
 Chiappone et al. (2016) S. Chiappone, O. Giuffrè, A. Granà, R. Mauro, A. Sferlazza, Traffic simulation models calibration using speed–density relationship: An automated procedure based on genetic algorithm, Expert Systems with Applications 44 (2016) 147–155.
 Pareek and Patidar (2016) N. K. Pareek, V. Patidar, Medical image protection using genetic algorithm operations, Soft Computing 20 (2016) 763–772.
 Davis et al. (2016) J. B. A. Davis, S. L. Horswell, R. L. Johnston, The application of a parallel genetic algorithm to the global optimisation of gasphase and supported goldiridium nanoalloys, The Journal of Physical Chemistry C (2016).

Perreault et al. (2015)
L. J. Perreault, M. Thornton,
R. Goodman, J. W. Sheppard,
A swarmbased approach to learning phasetype distributions for continuous time bayesian networks,
in: Computational Intelligence, 2015 IEEE Symposium Series on, IEEE, 2015, pp. 1860–1867.  Jones et al. (1996) B. Jones, H.H. Sthamer, D. Eyres, Automatic structural testing using genetic algorithms, Software Engineering Journal 11 (1996) 299–306.
 Wegener et al. (1997) J. Wegener, H. Sthamer, B. F. Jones, D. E. Eyres, Testing realtime systems using genetic algorithms, Software Quality Journal 6 (1997) 127–135.
 Rao and Govindarajulu (2015) C. P. Rao, P. Govindarajulu, Genetic algorithm for automatic generation of representative test suite for mutation testing, International Journal of Computer Science and Network Security (IJCSNS) 15 (2015) 11.
 Tan et al. (2003) K. C. Tan, Q. Yu, C. Heng, T. H. Lee, Evolutionary computing for knowledge discovery in medical diagnosis, Artificial Intelligence in Medicine 27 (2003) 129–154.
 PenaReyes and Sipper (1999) C. A. PenaReyes, M. Sipper, A fuzzygenetic approach to breast cancer diagnosis, Artificial intelligence in medicine 17 (1999) 131–155.
 Xie et al. (2009) X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, T. Y. Chen, Application of metamorphic testing to supervised classifiers, in: Quality Software, 2009. QSIC’09. 9th International Conference on, IEEE, 2009, pp. 135–144.
 Guderlei and Mayer (2007) R. Guderlei, J. Mayer, Statistical metamorphic testing testing programs with random output by means of statistical hypothesis tests and metamorphic testing, in: Quality Software, 2007. QSIC’07. Seventh International Conference on, IEEE, 2007, pp. 404–409.
 Stewart et al. (2004) T. J. Stewart, R. Janssen, M. van Herwijnen, A genetic algorithm approach to multiobjective land use planning, Computers & Operations Research 31 (2004) 2293–2313.
 Chen and Yiu (1998) S. C. C. Chen, Tsong Y., S. M. Yiu, Metamorphic testing: a new approach for generating next test cases (1998).
 Geronimo et al. (2012) L. D. Geronimo, F. Ferrucci, A. Murolo, F. Sarro, A parallel genetic algorithm based on hadoop mapreduce for the automatic generation of junit test suites, in: Software Testing, Verification and Validation (ICST), 2012 IEEE Fifth International Conference on, IEEE, 2012, pp. 785–793.
 Henard et al. (2013) C. Henard, M. Papadakis, G. Perrouin, J. Klein, Y. L. Traon, Multiobjective test generation for software product lines, in: Proceedings of the 17th International Software Product Line Conference, ACM, 2013, pp. 62–71.
 Sharma et al. (????) C. Sharma, S. Sabharwal, R. Sibal, Applying genetic algorithm for prioritization of test case scenarios derived from uml diagrams, arXiv preprint arXiv:1410.4838 (????).
 Arcuri and Briand (2014) A. Arcuri, L. Briand, A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering, Software Testing, Verification and Reliability 24 (2014) 219–250.
 Mitchell et al. (1997) T. M. Mitchell, et al., Machine learning. wcb, 1997.
 Xie et al. (2011) X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, T. Y. Chen, Testing and validating machine learning classifiers by metamorphic testing, Journal of Systems and Software 84 (2011) 544–558.
 Ishibuchi et al. (1997) H. Ishibuchi, T. Murata, I. Türksen, Singleobjective and twoobjective genetic algorithms for selecting linguistic rules for pattern classification problems, Fuzzy Sets and Systems 89 (1997) 135–150.
 Corcoran and Sen (1994) A. L. Corcoran, S. Sen, Using realvalued genetic algorithms to evolve rule sets for classification, in: Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on, IEEE, 1994, pp. 120–124.
 Hall et al. (2009) M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The weka data mining software: an update, ACM SIGKDD explorations newsletter 11 (2009) 10–18.
 Yoo (2010) S. Yoo, Metamorphic testing of stochastic optimisation, in: Software Testing, Verification, and Validation Workshops (ICSTW), 2010 Third International Conference on, IEEE, 2010, pp. 192–201.
 Murphy et al. (2009) C. Murphy, K. Shen, G. Kaiser, Automatic system testing of programs without test oracles, in: Proceedings of the eighteenth international symposium on Software testing and analysis, ACM, 2009, pp. 189–200.
 Andrews et al. (2005) J. H. Andrews, L. C. Briand, Y. Labiche, Is mutation an appropriate tool for testing experiments?[software testing], in: Software Engineering, 2005. ICSE 2005. Proceedings. 27th International Conference on, IEEE, 2005, pp. 402–411.
 Hancock (1994) P. J. Hancock, An empirical comparison of selection methods in evolutionary algorithms, in: AISB Workshop on Evolutionary Computing, Springer, 1994, pp. 80–94.
 Goldberg and Deb (1991) D. E. Goldberg, K. Deb, A comparative analysis of selection schemes used in genetic algorithms, Foundations of genetic algorithms 1 (1991) 69–93.
 Strasser et al. (2016) S. Strasser, J. Sheppard, N. Fortier, R. Goodman, Factored evolutionary algorithms, in: IEEE Transactions on Evolutionary Computation, 2016, p. Under review.
 Storn and Price (1995) R. Storn, K. Price, Differential evolutiona simple and efficient adaptive scheme for global optimization over continuous spaces, volume 3, ICSI Berkeley, 1995.
 Jamil and Yang (2013) M. Jamil, X.S. Yang, A literature survey of benchmark functions for global optimisation problems, International Journal of Mathematical Modelling and Numerical Optimisation 4 (2013) 150–194.
 Deb and Agrawal (1998) K. Deb, S. Agrawal, Understanding interactions among genetic algorithm parameters., in: FOGA, 1998, pp. 265–286.
 Coles (2012) H. Coles, Pit mutation testing, 2012.
Comments
There are no comments yet.