Genetic and Evolutionary Algorithms (GEAs) solve optimization and search problems by codifying a population of possible solutions, evaluating them for fitness, and iteratively modifying them in an attempt to improve their fitness. The digital manifestation of the solutions are called genotypes, and their interpretations into the specific problem domain are called phenotypes. The function that maps from genotypes to phenotypes is simply called the representation, and it can have a significant impact on the success and speed of the GEA to approximate optimal solutions. Consequently, many empirical and theoretical studies investigated the effects of representation on GEA performance under different variation operators (Sec. 2).
Variation of genotypes is a key component of evolution, both biological [mitchell07:evolutionary] and computational [goldberg92:genetic]. Perhaps the most common operator for variation in GEAs is mutation, which may be combined with another operator, recombination [back2018:evolutionary]. There are various implementations of the mutation operator, but typically they embody a localized change to an individual genotype. Contrast this with recombination, which requires two or more individuals and often involves nonlocal changes to the genotypes.
One popular example is point mutation, a simple mutation operator which randomly changes one allele at a time. Even within point mutation there are variations and parameters that control the magnitude of the genotypic change. Thus, in a search GEA, mutation can be used both for exploration—sampling many disparate parts of the search space—and for exploitation—thoroughly searching in a localized subspace [crepinsek13:exploration].
An important property of mutation that can increase the predictability and interpretability of the GEA is to have “good locality,” which we informally define here as the property that small variations in the genotype lead to small variations in the phenotype [pereira06:analysis]. Good locality implies better control of the GEA, because tuning mutation for a certain magnitude of changes in the genotype—the inputs to the search—leads to an expectation of the magnitude of changes in the phenotype—the outcome of the search. Note that the mutation operator is intricately tied to the representation. It is the combination of the mutation operator and representation that determines the magnitude of phenotypic change. Often, the discussion of locality assumes a fixed mutation operator, such as uniform bit flips or Gaussian differences, and focuses on the representations, such as standard binary or Gray.
In his seminal book, Rothlauf presented a theoretical framework for comparing representations based on three aspects: redundancy, scaling, and locality [rothlauf06:representations]. Rothlauf proposed two metrics to quantify the locality of representations, one specifically for point mutation (called simply ”locality” in the book) and one for any variation operator (called ”distance distortion”).
This paper focuses on these metrics when applied to a widely used subclass of representations, namely nonredundant translations from bitstring genotypes to nonnegative integer phenotypes (for example, Gray encoding [whitley99:free]). After surveying previous studies in the next section, we derive new lower and upper bounds on point locality, as well as asymptotic limits for general locality in Sec. 3. We then attempt to replicate and explains past experimental results in the context of these derived values in Sec. 4. Most of these experiments are not novel. In fact, we tried to reproduce these past studies as faithfully as we could so that we could frame the old performance results with our new understanding of locality. Surprisingly, the locality metrics in this domain offer little predictive value on GEA performance, and we discuss possible reasons for this finding in Sec. 5.
2 Related Work
Because of the strong impact of representations on GEA performance, there have been several attempts to formalize and measure the effects of representation locality [galvan11:defining, gottlieb01:prufer, gottlieb99:characterizing, jones95:fitness, manderick91:genetic, ronald97:robust]. These measures have been applied to a variety of phenotype classes, such as floating-point numbers [pereira06:analysis], computer programs [galvan11:defining, rothlauf06:grammar], permutations [galvan09:effects], and trees [hoai06:representation, rothlauf99:tree]. Many of these approaches focus on measuring the effect of genotypic changes on fitness distances [gottlieb99:characterizing]. A more general approach is to limit the scope to the effect on phenotypic distances, because it provides a locality measure of a representation that is independent of fitness function.
The foundational locality definitions for our study and several others [chiam06:issues, galvan10:towards], come from Rothlauf’s treatise of the theory of GEA representations [rothlauf06:representations]. Rothlauf defines the locality as:
Where for every two distinct genotypes , is the genotypic distance between and , and is the phenotypic distance between their respective phenotypes, based on our metrics of choice for genotypic and phenotypic spaces. Similarly, and represent the minimum possible distance between genotypes or phenotypes, respectively.111This definition has changed from the book’s first edition to better match reader intuition and for simpler computation. For example, for nonredundant representations of integers as bitstrings (the focus of this paper), genotypic distances are measured by Hamming distance and phenotypic distances use the usual Euclidean metric in .
This definition “describes how well neighboring genotypes correspond to neighboring phenotypes” (ibid. p. 77), which is valuable for measuring small genotypic changes, typically resulting from a single mutation. Extending this notion to include large genotypic changes, e.g., from a recombination operator, Rothlauf defines distance distortion as:
Where is the size of the search space, and are the phenotypic and genotypic distance, respectively, between two distinct individuals , . The term is equal to , the proportion of each distinct pair of individuals. This definition “describes how well the phenotypic distance structure is preserved when mapping on ”, where is the phenotypic search space and is the genotypic search space (ibid. p. 84).
Similar in spirit, Gottlieb and Raidl also defined a pair of locality metrics for mutation and crossover operators called mutation innovation and crossover innovation [gottlieb99:characterizing]. They additionally defined crossover loss, which measures the number of phenotypic properties that are lost by crossover. These metrics are probabilistic and empirical in nature, so they are harder to reason about analytically. But they have been demonstrated in practice to predict relative GEA performance on the multidimensional knapsack problem [pereira06:analysis, raidl05:empirical].
In a different study, Chiam et al. defined the concept of preservation, which “measure the similarities between the genotype and phenotype search space” [chiam06:issues]. Their study uses Hamming distance between genotypes and L2 norms between phenotypes to define analogous metrics to Rothlauf’s (called proximity preservation and remoteness preservation), but looking in both directions of the genotype-phenotype mapping. The authors demonstrated (as we prove in the next section), that standard binary and Gray encodings have the same genotype-to-phenotype locality, but not phenotype-to-genotype locality. They also predicted, based on this similarity, that crossover-based GEAs would perform about the same with both binary and Gray encodings, which contradicts some past results, as we discuss in Sec. 4.3.
A different approach to approximating locality was given by Mathias and Whitley [mathias94:transforming], and independently, by Weicker [weicker10:binary], using the various norms of the phenotype-genotype distance matrix for different representations. The studies did not draw performance predictions from these metrics.
In this study we build on Rothlauf’s definitions because they are both independent from the fitness function, and explicitly computable in our domain. It is worth repeating that our definitions of locality are simply reformulations of Rothlauf’s definitions for the domain of nonredundant binary-integer representations, which we will refer to “representations” for brevity. In the next section we proceed to analyze the theoretical limits of these metrics, while the subsequent section discusses their applicability in practice. As a case study, we compare the theoretical locality of standard binary and Gray representations, and then evaluate them experimentally, especially in the context of past studies that found Gray to outperform SB for several test functions.
3 Theoretical Results
In this section we show that point locality for standard binary representation is no worse than Gray’s, and that the distance distortion of all representations is asymptotically equivalent.
For our scope, we define a representation as a bijection between the set of -bit bitstrings and the discrete integer interval . This ensures that the representation is not redundant – i.e., every integer in the interval is represented by exactly one -bit bitstring, and the number of search-space points is exactly . A representation can therefore be equivalently described as a permutation , where if and only if the SB representation of maps to under . Consequently, we can write as a -tuple where the coordinate (starting at 0) is . We also use the notation to denote the binary string produced from flipping the coordinate of the binary string . Formally, , where denotes exclusive-or.
We can now develop an equivalent definition to Rothlauf’s locality that is specific to our domain, using units that we find more intuitive. We define the point locality for a nonredundant bitstring-to-integer representation as the mean change in phenotypic value for a uniformly random single-bit flip in the genotype. More formally:
The point locality for a representation is . Explicitly,
Note that for a fixed value of , our definition of
is a simple linear transformation of Rothlauf’s. In our domain, simply sums the phenotypic distances minus one between all distinct genotypic neighbors, while
computes the average phenotypic distance between all ordered pairs of genotypic neighbors. Thus, we have the relationship:
We also develop an equivalent definition to Rothlauf’s distance distortion, called general locality, that is tailored to our domain. We define the general locality for a representation as the mean difference between phenotypic and genotypic distances between each unique pair of individuals. More formally:
where is the set of all unordered pairs from (so , is the phenotypic distance , and is the genotypic (Hamming) distance between and . Note that our definition of general locality mirrors Chiam’s remoteness preservation and is also equivalent to Rothlauf’s metric, since is identical to .
It is worth repeating previous observations that high values of , , , or actually denote low locality while low values denote high locality [galvan11:defining]. To avoid confusion, we will refer to low metric values as strong locality and high metric values as weak locality.
3.2 Single-Bit Mutation
Our analysis of single-bit mutation proves lower and upper bounds on , computes for SB and Gray representations, and verifies the existence of representations with maximum .
Theorem 1 (Lower bound).
We reduce the problem of minimizing locality to another problem, that of enumerating nodes on a hypercube while minimizing neighbor distances, for which lower and upper bounds have already been established by Harper [harper64:optimal]. We use the term to denote the absolute difference between the numbers assigned to two adjacent vertices and on the unit -cube. The unit -cube consists of all elements in . Two vertices in the -cube are adjacent if they differ by only one bit (i.e., they have Hamming distance 1). Note that assigning a number to a vertex can be thought of as a representation mapping to , or . Therefore we have for adjacent vertices and , since a 1-bit difference is equivalent to a single bit-flip mutation. Furthermore, Harper defines the sum to be the sum of the absolute difference between two adjacent vertices and that runs over all possible pairs of neighboring vertices in the -cube. Note that
since the RHS computes twice for every ordered pair.
Harper proved that . Therefore,
proving that . ∎
Standard binary encoding is optimal, meaning that it has the strongest point locality .
SB encoding is also a representation—call it . We consider :
The inner sum computes the sum of all differences obtained from flipping the th bit of a given SB string . Flipping the th bit elicits an absolute phenotypic difference of for any and , reducing the inner sum to:
since it is a geometric series with common ratio . Now since there are elements in , the outer sum reduces to . Combining these lets us compute :
which is the lower bound given by Theorem 1. Thus, SB has optimal point locality. ∎
Binary Reflected Gray (BRG) encoding is also optimal.
Let notate the representation for Binary Reflected Gray and
We start by proving the following two lemmas:
In other words, the sum of the differences obtained by flipping the leftmost bit over all -bit bitstrings in BRG encoding is .
Consider the recursive nature of BRG codes [rowe04:properties]. Let be the ordered list of -bit BRG codes where is the bitstring that maps to . Note that is the length of and denotes list indexing. The left half of contains prefixed with 0 and the right half of contains in reverse order prefixed with 1. Flipping the leftmost bit of will yield . Thus we have
Note that for a given ,
Which lets us split the sum to
Splitting the left sum and using the facts that the sum of the first odd numbers is ,
Thus . ∎
We proceed with induction on . For the base case (), the set contains two BRG codes, , which corresponds to the integers 0 and 1, respectively. Thus .
For the inductive hypothesis (I.H.), assume for some . We must now show that . Note that in the inductive step, we are working with strings of length .
By I.H. and the fact that there are two copies of the -bit BRG code in the -bit BRG code,
By Lemma 4,
Note that this equivalence in locality has already been demonstrated empirically for small values of [chiam06:issues], but this proof holds for all values.
There exists a Gray encoding with suboptimal point locality for any .
Recall the -cube, , that contains all -bit bitstrings. A bitstring has neighbors that are all Hamming distance one away. We can thus reduce the problem of constructing a Gray code to constructing a Hamiltonian path on the -cube. Recall in Theorem 1 that we mapped the problem of minimizing locality to that of minimizing . In the same paper, Harper formulates an algorithm that provably generates all representations that minimize , as follows:
Assign to a random vertex.
For from 1 to , assign to the vertex with the highest number of already labeled neighbors. If there are multiple vertices with this property, choose one at random.
Our goal is to construct a Hamiltonian path that violates this algorithm. This path in turn will determine a Gray code that has suboptimal point locality, because Harper’s algorithm generates all representations with optimal point locality.
Our modified algorithm starts by assigning 0 to vertex , where the notation denotes an length bitstring of all 0s. We then assign 1 to and assign 2 to . The above algorithm would force us to assign 3 to if we wanted to produce an optimal Gray code. Instead, we assign 3 to , which violates the algorithm. The remainder of the Hamiltonian path can be traversed arbitrarily. This construction generates a Gray code with for . ∎
Theorem 7 (Upper bound).
We rely on another result from Harper, proving that . We have
Thus . ∎
There exists a representation with upper bound point locality .
In this proof, we reduce the problem of constructing a representation with upper bound locality to that of assigning integers in to vertices in the -cube such that is maximized. Harper also demonstrates an algorithm assigning numbers to vertices that provably maximizes , as follows:
Assign 0 to a random vertex. Let be the number of ’1’ bits in that vertex.
Assign the integers randomly to vertices (bitstrings) whose number of constituent 1s have the same parity as .
Assign the remaining numbers to the leftover vertices at random. These are the vertices whose number of constituent 1s have the opposite parity as .
Maximizing is equivalent to maximizing , so such a representation exists. ∎
Having explored the properties of point locality for single-bit mutations, we now turn our attention to general locality and distance distortion for any variation operator.
3.3 General Locality
We start our analysis of general locality by proving a lower bound on its value, and continue by proving the asymptotic equivalence of all nonredundant binary integer representations under this metric.
By definition, . By the triangle inequality, we have
where we let and . Since only deals with phenotypes in , it is equivalent to
Let the outer sum fix . Then the inner sum computes the sum of numbers from to . This reduces to
where we let for simplicity. Using the facts that the sum of the first natural numbers is and the sum of the first squares is , we have
Substituting back into the equation yields:
Since is the sum of the Hamming distances between all unique pairs of bitstrings, it is equivalent to
because for each of the bitstrings, a bitstring has other bitstrings with Hamming distance (choose of the bits to be flipped). We divide by two because we count each pair twice. We can simplify , obtaining
Substituting and back into yields
for any representation on bits. That is, for a fixed , is also asymptotically constant irrespective of .
The key intuition behind this proof is that for nonredundant binary-integer representations, as grows, the phenotypic distances grow at an asymptotically greater rate than the genotypic distances. We can separate the phenotypic distances from the genotypic distances by partitioning into two sets , where denotes disjoint union:
In other words, contains all the pairs in where the two bitstrings have greater phenotypic (Euclidean) distance than genotypic (Hamming) distance, and contains all pairs where the two bitstrings have greater or equal genotypic distance than phenotypic distance. We can rewrite as (letting
where we let
Note that since each of the bitstrings can have at most bitstrings for which their genotypic distance is greater or equal to their phenotypic distance, and we divide by two since we count each pair twice — thus . We can now say , so . Consider
since grows faster than . Thus dominates , and .
We can now perform a similar analysis for and . Note that since any pair in can have a maximum of . Thus . Also note that since any pair in can have a minimum of 1. Thus . Consider
since grows faster than , and so . Now we can make a statement about . We have
Thus . Since and , we replace with in the summation to obtain (recall )
which was found in the proof of Theorem 9. Thus and