# Simple Genetic Operators are Universal Approximators of Probability Distributions (and other Advantages of Expressive Encodings)

This paper characterizes the inherent power of evolutionary algorithms. This power depends on the computational properties of the genetic encoding. With some encodings, two parents recombined with a simple crossover operator can sample from an arbitrary distribution of child phenotypes. Such encodings are termed expressive encodings in this paper. Universal function approximators, including popular evolutionary substrates of genetic programming and neural networks, can be used to construct expressive encodings. Remarkably, this approach need not be applied only to domains where the phenotype is a function: Expressivity can be achieved even when optimizing static structures, such as binary vectors. Such simpler settings make it possible to characterize expressive encodings theoretically: Across a variety of test problems, expressive encodings are shown to achieve up to super-exponential convergence speed-ups over the standard direct encoding. The conclusion is that, across evolutionary computation areas as diverse as genetic programming, neuroevolution, genetic algorithms, and theory, expressive encodings can be a key to understanding and realizing the full power of evolution.

## Authors

• 13 publications
• 8 publications
• 44 publications
08/21/2021

### Evolving Evolutionary Algorithms using Linear Genetic Programming

A new model for evolving Evolutionary Algorithms is proposed in this pap...
07/27/2010

### Computational Complexity Analysis of Simple Genetic Programming On Two Problems Modeling Isolated Program Semantics

Analyzing the computational complexity of evolutionary algorithms for bi...
09/22/2020

### Multi-threaded Memory Efficient Crossover in C++ for Generational Genetic Programming

C++ code snippets from a multi-core parallel memory-efficient crossover ...
02/11/2019

### Interaction-Transformation Evolutionary Algorithm for Symbolic Regression

The Interaction-Transformation (IT) is a new representation for Symbolic...
09/18/2021

### Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts

Evolutionary algorithms have been used in the digital art scene since th...
01/30/2020

### A Study of Fitness Landscapes for Neuroevolution

Fitness landscapes are a useful concept to study the dynamics of meta-he...
09/01/2017

### Algorithmically probable mutations reproduce aspects of evolution such as convergence rate, genetic memory, modularity, diversity explosions, and mass extinction

Natural selection explains how life has evolved over millions of years f...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Evolutionary algorithms (EAs) promise to generate powerful solutions in complex environments. To fulfill this promise, not only must great solutions exist somewhere in their search space, but EAs must be able to find them with non-vanishing probability even in high-dimensional, dynamic, and deceptive domains. One way to deliver on this promise is to design more and more specialized operators to capture the structure of a particular domain. Although this approach has led to many practical successes

(chicano2017optimizing; deb2016breaking; goldberg1994genetic; whitley2019next), it is inherently limited by the human effort required to design such operators for each domain: That is, humans remain a bottleneck.

This paper advocates for an alternative approach: using simple, generic evolutionary operators with general genotype encodings in which arbitrarily high levels of complexity can accumulate. Such encodings are termed expressive encodings in this paper, and are analyzed theoretically and experimentally. Using standard simple genetic operators (SGOs) involving only one or two parents, expressive encodings are capable of universal approximation of child phenotype distributions. Thus, in theory, no hand-design is necessary to capture complex domain-adapted behavior.

Expressive encodings can be implemented with universal function approximators such as genetic programming and neural networks; the paper shows that such systems can achieve up to super-exponential improvements in settings that are intractable with standard direct encodings. The space of expressive encodings overlaps with indirect encodings, but is distinctly different: Some direct encodings can be expressive as well.

The expressive encoding approach can be seen as analogous to recent progress in machine learning: Prior to deep learning

(goodfellow2016deep; lecun2015deep)

, progress was made by crafting better features and increasingly complex learning algorithms to solve different kinds of problems. With deep learning, large, complex structures with a general form (deep networks) allowed better solutions to be found across huge swaths of problems using a generic and simple algorithm: stochastic gradient descent (SGD). EA could make similar progress via complex, general encodings evolved with SGOs.

The approach creates further opportunities in at least four areas of evolutionary computation:

• Genetic Programming and Neuroevolution: Expressive encodings in these frameworks can be developed further and used as a starting point for improved methods.

• Genetic Algorithms: They can be defined for GAs in general, increasing their power.

• Theory: They can serve as a foundation for a new theoretical perspective on EAs.

In other words, not only are expressive encodings powerful and valuable in their own right, they can provide a shared platform of research that would allow fruitful communication and knowledge-sharing between often disparate subfields.

## 2. Conceptual Framework

This section provides background on encodings and genetic operators, including running examples that will be used in the paper.

### 2.1. Encodings

An encoding is a function , where is a set of genotypes and is a set of phenotypes. The genotype (or genome) is a description of the individual, which uses to produce the phenotype , which can then be evaluated in a given environment. A fitness function assigns a score to the phenotype. The evolutionary computation literature is filled with a diversity of encodings. This paper focuses on some of the most popular, and, for simplicity, primarily uses a boolean vector phenotype space , but the results can be extended to other settings.

#### 2.1.1. Direct Encoding

This is the simplest and most popular approach in using GAs and EAs for optimization. With a direct encoding, is the identity function , and . It is called direct because the algorithm directly evolves elements of the phenotype: the genotype contains no information or structure not in the phenotype, and visa versa. When evolving boolean vectors with a direct encoding, the genotype and phenotype are simply a vector of bits, which is the most common setting in EA theory (auger2011theory; doerr2015black; doerr2012multiplicative; watson2007building; witt2013tight).

#### 2.1.2. Neural Networks

Evolving neural networks (NNs), or neuroevolution (NE), is a popular approach to discovering solutions for problems like function approximation, control, and sequential decision-making (fogel1990evolving; miikkulainen2019evolving; stanley2002evolving; stanley2019designing; such2017deep; yao1999evolving). The phenotype is usually a function, which may receive multiple distinct inputs during its evaluation. Many encodings exist for evolving NNs. The simplest of them is a direct encoding, where the NN structure is fixed and its weights are evolved (fogel1990evolving; such2017deep). However, NNs can also be used as the genetic encoding in cases where the phenotype is not a function but simply a fixed structure, such as a binary vector. In this case, since there is no varying input to the phenotype during evaluation, a fixed value, e.g., 1, can simply be fed into the network to generate the fixed phenotype. Formally, an NN genotype is of the form and a sigmoid activation in the final layer squashes the output of the NN into . Therefore, the overall encoding is

 (1) E(h)=round(σ(h(1))),

where ‘round’ is applied elementwise, so that produces a binary vector. In this paper, all internal nodes have sigmoid activation, and all biases are 0 unless otherwise noted.

Although we are not aware of any prior work that has evolved NNs to optimize binary vectors, there is a substantial prior work evolving NNs to generate static artifacts such as pictures (krolikowski2020quantum; liapis2021transforming; secretan2008picbreeder) and 3D objects (clune2011evolving; lehman2016creative). Similarly to this paper, the motivation is that the structure of NNs can lead to interesting patterns in how offspring phenotypes relate to parent phenotypes. High-level (and often interpretable (huizinga2018emergence)) reproductive complexity can be achieved that would not be possible if, say, pixels or voxels were evolved directly. The complexity of what is possible in evolution at any point in time is accumulated in what has been evolved so far. This area of work has achieved impressive, visually appealing results (lehman2012beyond; nguyen2015innovation). A goal of the present work is to show that such an approach is indeed fundamentally powerful for problem-solving, and overcomes serious fundamental limitations of direct encodings.

#### 2.1.3. Genetic Programming

Like neuroevolution, genetic programming (GP) is used in situations where phenotypes require complex behavior (brameier2007linear; hodjat2014maintenance; koza1992genetic; langdon2013foundations; o2013ec; spector2002genetic; shahrzad2015tackling). GP is often used for function approximation (orzechowski2018we; schmidt2009distilling), but its main motivation is to generate programs that meet a given description. As evolved programs accumulate complexity, like in the case of NNs (huizinga2018emergence; lehman2013evolvability), the behavior of evolution itself changes over time (altenberg1994evolution; banzhaf2014genetic; hu2011robustness). So, like NN, GP can be used to generate static structures, leading to benefits from the complexity of evolved programs. As there are many languages for human programmers, there are many encodings for GP (brameier2007linear; koza1992genetic; spector2002genetic). The encoding can generally be defined by the available terminals and operators. This paper considers a small set of terminals and operators, which can naturally be extended:

• Terminals: binary vectors with evolved values, integers;

• Operators: , , , , , if, return.

The binary vectors found at terminals can be of varying length, but at least one must be of length so that the program can return a solution of length . Vectors of length 1 will be broadcast when used in binary operations; ‘’ and ‘’ compare vectors from left to right as if they were binary integers, e.g., ; ‘par’ returns the parity of the vector; ‘’ performs elementwise addition (mod 2). A program using these operators can be rendered in sequential, tree, or graph form (brameier2007linear; koza1992genetic); this paper uses sequential programs for readability.

Of course, NNs and programs on their own are expressive in the space of functions: NNs can approximate any function arbitrarily closely (cybenko1989approximation; hornik1989multilayer; kolmogorov1957representation), and sufficiently powerful programming languages can compute anything that is computable (church1936unsolvable; spector2002genetic; turing1936computable). This paper focuses on a different type of expressivity: How do different encodings enable the evolutionary process to behave in complex and powerful ways? This behavior involves pairing encodings with genetic operators, which are discussed next.

### 2.2. Genetic Operators

Genetic (or evolutionary) operators are the mechanisms by which genomes reproduce to generate other genomes. A genetic operator is a (usually stochastic) function that produces a new genome given a set of parent genomes . Since results in a distribution over genomes, we can write . This paper focuses on the two most common operators (in both nature and computation): crossover and mutation. For consistency, assume the genotype can be flattened into a string of symbols in a canonical way, so that refers to the th symbol in the string form of a genotype . The following operators are likely familiar to EA practitioners, but they are briefly described here for completeness.

#### 2.2.1. Uniform Crossover

This operator takes two parents of equivalent structure and produces a child by independently selecting the value in one of the parents at random for each element. That is, , where

is the uniform distribution. Importantly, if the two parents have the same value at index

, then the child is also guaranteed to have that value at : . The results on uniform crossover in this paper should be extendable to other forms of crossover, including single- and multi-point crossover (de1992formal; mitchell1991royal; watson2007building).

#### 2.2.2. Single-point Mutation

With single-point mutation the child is a copy of a single parent with a single location altered. For example, if the encoding is an NN, can alter a single weight in the network, e.g., by adding Gaussian noise. If the encoding is GP, can replace a symbol with a different valid symbol, e.g., flip a bit of a location that contains a binary value. Single-point mutation is similar to uniform mutation, which mutates each element independently with equal probability, but is simpler to analyze, and has been shown to be effective in many benchmarks (doerr2008comparing).

#### 2.2.3. Simple Genetic Operators (SGOs)

We call a genetic operator simple if both its description length and the cardinality of its parent set are constant (w.r.t. the phenotype dimensionality ) . Clearly, the above examples are simple, which matches our intuition, since they are some of the most basic operators in EAs. These operators are also simple in the colloquial sense: They are easy to explain, implement, and apply in a wide variety of settings without invoking too much domain-specific or encoding-specific complexity. Operators that are not simple include model-based EAs such as EDAs (harik1999compact; hauschild2011introduction; pelikan1999boa), linkage-based algorithms (goldman2014parameter; thierens2010linkage), and evolutionary strategies (hansen2006cma; wierstra2014natural), which construct -size probabilistic models from genotypes, and use these models to generate new candidates. Such models usually have a restricted structure, but in theory could model any phenotype distribution. This paper shows that SGOs are also fundamentally powerful, as long as the encoding is sufficiently complex. SGOs are more in line with how evolution operates in nature, as well as the original motivation for GAs (holland1992genetic). They also avoid the pitfalls of a central algorithmic bottleneck: The operations of variation in the algorithm can be exceedingly simple, and can still result in powerful complex behavior via complexity accumulating in genotypes. This idea is made formal in the next section.

## 3. Expressive Encodings

The behavior of evolutionary algorithms can be described by their transmission function, which probabilistically maps parent phenotypes to child phenotypes (cavalli1976evolution; slatkin1970selection; altenberg1994evolution). EAs that use expressive encodings can, in principle, yield the behavior of any transmission function, making them general and powerful generative systems.

###### Definition 0 (Expressive Encoding).

An encoding is expressive for a simple genetic operator if, for any set of phenotypes with , any probability density over , and any , there exists a parent set such that , and

 (2) ∣∣ Pr[E(g(Xp))=y]−μ(y) ∣∣<ϵ  ∀y∈Y.

In other words, starting from any initial set of parent phenotypes, a single application of the genetic operator can generate any distribution of child phenotypes. This paper focuses on the case where is a discrete distribution, but the ideas can be extended to the continuous case. The complexity of an expressive encoding with a particular genetic operator is the genome size required to achieve the desired level of approximation. Expressive encodings have the satisfying property that the current definition of the evolutionary process must be stored in the genomes themselves. There is no reliance on a large external controller of evolution; the only external algorithmic information is in SGOs; the process and artifacts of evolution are stored at a single level: in the population.

It may seem that all expressive encodings are indirect encodings (stanley2003taxonomy; stanley2009hypercube; stanley2019designing). However, if the phenotype space is sufficiently expressive, such as in the case of searching for a neural network to perform a task, then a direct encoding may be an expressive encoding. This property is demonstrated for the case of neural networks in the next section (see Theorem 4.6

). There are other kinds of systems capable of universal approximation of probability distributions, such as neural generative models (e.g., the generator in a GAN architecture

(goodfellow2014generative; lu2020universal)). If EAs were not capable of such universality, then one should be hesitant to use them as a problem-solving tool or engine of a generative system. Fortunately, they are, as is shown next.

## 4. Expressivity of GP, NN, Direct, and Universal Approximator Encodings

This section provides constructions that illustrate the fundamental power of expressive encodings, including showing that NNs and GP are expressive for uniform crossover and one-point mutation, while the direct encoding is not. A key mechanism is representation of switches in the genome. This mechanism is first introduced in the special case of analyzing big jumps in GP, NN, and direct encoding, and the results are then generalized to full universal approximation.

### 4.1. Special Case: Miracle Jumps

There is a common hope in evolutionary computation of stumbling upon ‘big jumps’ that end up being useful. There is a large theory literature on how well algorithms handle ‘jump functions’, i.e., problems where there are local optima from which a jump must be taken to reach the global optimum (bambury2021generalized; dang2016escaping; whitley2018exploration). This subsection shows how expressive encodings can enable reliable jumps of maximal size—a useful feature for EAs in general. We call this behavior a miracle jump because, if it were to occur in a standard evolutionary algorithm with a direct encoding, it would be considered a miracle.

Let . Given an encoding and genetic operator , the goal is to find such that and In other words: Find parents whose phenotypes are all zeros, but who generate a child of all ones with probability that does not go to zero as increases. To solve this problem, the encoding must be capable of encoding a kind of switch that allows non-trivial likelihood of flipping to the all ones state.

#### 4.1.1. Genetic Programming

Consider the two parent programs shown in Figure 1(a). The two parents are equivalent except for their values of and . So, uniform crossover results in a chance that .

#### 4.1.2. Neural Networks

Consider the two parent NNs shown in Figure 1(b). The two parents are equivalent except for the values of each of their two weights in the first layer. With uniform crossover, there is a chance the weights of the first layer of the child are all , resulting in phenotype of .

#### 4.1.3. Direct Encoding

Since there are no 1’s in either parent (Figure 1(c)), this solution is impossible for crossover to find. Single-point mutation only changes a single bit, but even uniform mutation fails: It flips bits i.i.d. with probability , and

The GP and NN constructions above are built on the idea of having a large part of the genome being completely shared between the two parents, and then having some auxiliary values that are unshared and function as control bits that have a high-level influence on how the phenotype is generated. This kind of construction is used to generalize these results in the next subsection.

### 4.2. General Case: Universal Approximation

This subsection shows that certain encodings are expressive with SGOs. Unless otherwise noted, is the phenotype space of binary vectors . Let be the desired phenotypes in the child distribution, and be their associated probabilities. Notice that to achieve an approximation error of , any phenotypes that have can be ignored, so can be assumed. While expressivity can be demonstrated with many different constructions, the goal of this section is to provide constructions that are both intuitive and highlight where the power of expressive encodings comes from. Sketches of the proofs are provided below. Detailed proofs are included in Appendix A.

###### Theorem 4.1.

Genetic programming is an expressive encoding for uniform crossover, with complexity .

Proof sketch. Take the parents in Figure 2(a). They differ only in their values of , and their child’s value can be viewed as an integer sampled uniformly from . If , then the ’s can be chosen so that probability is apportioned to each with error less than . ∎

###### Theorem 4.2.

Genetic programming is an expressive encoding for single-point mutation, with complexity .

Proof sketch. Take the parent in Figure 2(b). Let be proportional to , and then scale up all so the chance of a mutation in a is sufficiently small. ∎

Notice that these constructions resemble the Pitts GP model (hodjat2014maintenance; o2013ec; smith1980learning; urbanowicz2009learning). Neither of the above constructions depend on the ’s being drawn from a particular phenotype space. The ’s could have any structure and be drawn from any phenotype space, and a similar construction could be provided, highlighting the fact that expressive encodings enable powerful EA behavior that can be general across problem domains.

Note also that the construction for crossover is asymptotically more compact than that for mutation: vs. in the first term. This result suggests that crossover possibly has an inherent advantage over mutation: It achieves equivalently complex behavior with a more compact genome. An interesting question is: Is this observation reflected in biological evolution? Importantly, the same phenomenon emerges for NNs:

###### Theorem 4.3.

Feed-forward neural networks with sigmoid activation are an expressive encoding for uniform crossover, with complexity .

Proof sketch. Take the parents in Figure 3(a), who differ only in their second-layer weights. The input to the child’s bottleneck is proportional to a uniformly-sampled integer. Choosing threshold biases and high enough , , and results in a mutually-exclusive switch for each that fires with the correct probability. ∎

In the case of NNs, let single-point mutation be a Gaussian mutation, i.e., a single weight is selected uniformly at random and modified by the addition of Gaussian noise. The expressivity construction is similar to that of crossover, leading to the following:

###### Theorem 4.4.

Feed-forward neural networks with sigmoid activation are an expressive encoding for single-point mutation, with complexity .

Proof sketch. Take Parent 2 from Figure 3(a). Choose so that mutation is very likely to occur in the first two layers. Since applying there yields some continuous distribution over bottleneck output, suitable thresholds and switches can be created. ∎

The constructions so far give concrete ways to create parents to demonstrate expressivity. The next result generalizes these ideas to encodings built from any universal function approximator. The idea is that any universal function approximator can be extended to an expressive encoding via , defined as follows:

###### Definition 0 (EΩ).

Let be any universal function approximator. Define to be an encoding whose genotypes are of the form , where , and is a function .

###### Theorem 4.5.

is an expressive encoding for uniform crossover.

Proof sketch. Choose a large enough . Then, let and , for a suitable . ∎ So, if you have a favorite class of models that is expressive in terms of its function approximation capacity, it can be turned into a potentially powerful evolutionary substrate.

The constructions so far have used indirect encodings, in that the phenotype space (e.g., binary vectors) is of a different form than the genotype space (e.g., GP, NN, or ). However, an encoding need not be indirect to be expressive, as is demonstrated by the following case of direct encoding of neural networks.

###### Theorem 4.6.

Direct encoding of feed-forward neural networks with sigmoid activation is an expressive encoding for uniform crossover.

Proof sketch. Take the parents in Figure 3(b). They differ only in the biases in the first layer of nodes. If this layer is large enough, has enough information from a child’s biases to decide which function the overall NN should compute. ∎

This section has shown that sufficiently complex evolutionary encodings, in particular NN and GP, are expressive. Another way of seeing the power of this property is to consider any stochastic process (e.g., EA) that samples from distribution after steps: An expressive encoding can simulate this behavior in one step. This view gives us an analogy to a powerful result from the NN meta-learning literature: Given a distribution of tasks, there exists a neural network that can learn any task from this distribution with a single step of gradient descent (finn2017meta). This connection is meaningful: while meta-learning should be able to encode any learning process, evolution should be able to encode any phenotype sampling process.

The fact that the above encodings are expressive with single-point mutation, also known as random local search (doerr2008comparing), is remarkable. Thanks to expressive encoding, random local search in the genotype space leads to maximally global evolutionary sampling in the phenotype space. The next section will show that this property can be used to solve challenging problems where the standard direct encoding is likely to fail.

## 5. Adaptation and Convergence

Section 4 showed how arbitrarily complex behavior is possible with a single application of an SGO when encodings are expressive. An immediate question is: Can this power actually be exploited to solve challenging problems with evolution? This section considers some of the most frustrating kinds of problems for the standard GA: those where there is high-level high-dimensional structure that can only be exploited if the population develops to a point where particular high-dimensional jumps emerge with reliable probability. Standard GAs struggle to achieve such an emergence, since, as the dimensonality of the required jump increases, its probability vanishes. This section considers three problems that cover the different ways in which such high-level high-dimensional structure can appear: It can be temporal and deterministic, temporal and stochastic, or spatial. In all three cases, expressive encodings, implemented through GP, yield striking improvements over direct encodings.

### 5.1. Problem Setup

As is common for ease of analysis (doerr2020probabilistic; lengler2020drift; meyerson2019modular; witt2013tight), this section considers the (1+)-EA, with (pseudocode is provided in Appendix B). In this algorithm, during each generation candidates are independently generated by mutating the current champion. One of the best candidates replaces the champion if it meets the acceptance criteria, which is usually that it has higher fitness. When the fitness function is dynamic, the fitness of the champion is evaluated in every generation along with the fitnesses of the candidates.

To evaluate algorithms on dynamic fitness functions, the concept of adaptation complexity needs to be defined.

###### Definition 0 (Adaptation Complexity).

The adaptation complexity of an algorithm on a dynamic fitness function is the expected time spends away from a global optimum vs. at a global optimum in the limit, i.e., the proportion of time it spends adapting. Formally, if is the maximum value of at time , and is the best fitness in the population at time , the adaptation complexity is

 (3) E[limt→∞|{t′

The three problems are considered in order of ease of analysis. It may seem counter-intuitive to consider dynamic fitness functions before static fitness functions, but it turns out that the dynamism of the fitness function provides an exploratory power that expressive encodings are able to exploit naturally, but direct encodings cannot exploit at all (other advantages of dynamic fitness have been observed in prior work (crombach2008evolution; kashtan2007varying)). As before, sketches of the proofs are provided in this section; details are included in Appendix C.

### 5.2. Challenge Problems

###### Problem 1 ().

Deterministic Flipping Challenge (DFC): Take any two target phenotypes , where is the complement of . At time , the current target vector is . The fitness is the number of bits in the phenotype that match the target. If the fitness of the champion is , then at the next time step the current target vector flips to the other target vector.

The difficulty in this problem is clear: As soon as the maximum fitness is achieved, the target changes. This kind of situation arises naturally in continual evolution, in which new problems arise over time, and the algorithm is initialized at its solution to the previous problem (braylan2016reuse; fernando2017pathnet; francon2020effective; luders2016continual; neumann2015runtime; wang2020enhanced). For example, imagine a neural architecture search system in which new machine learning problems arise over time. The algorithm cannot be restarted from scratch for each problem, because it is so expensive to run, so it picks up from where it left off. Partial attempts at such a system have been made previously (golovin2017google). Problem 1 can also be viewed as a dynamic version of the OneMax problem; other such versions have been considered in the past (droste2003analysis; jansen2005theoretical; sudholt2018robustness), but none allow for this extreme of a change in the target vector. More generally, this test problem evaluates an algorithm’s ability to avoid ‘catastrophic forgetting’: it should not be too quick to forget useful past solutions, and ideally, it should be able to recover them in constant time.

For the expressive encoding, consider GP genotypes with the structure shown in Figure 4(a), where , , are bit vectors of length , . So, the evolvable genome is defined by , and has length . The champion is initialized with random bits. It turns out that GP results in a super-exponential speed-up: While the direct encoding takes to adapt to the new target, the GP encoding, initially ignorant of and , spends only constant time.

###### Theorem 5.1.

The (1+)-EA with direct encoding has adaptation complexity on the deterministic flipping challenge.

Proof sketch. This follows from a standard coupon collector argument (doerr2020probabilistic). ∎

###### Theorem 5.2.

The (1+)-EA with GP encoding has adaptation complexity on the deterministic flipping challenge.

Proof sketch. Let . Say and . Then, the algorithm only accepts mutations that move towards . Once and is mutated, will converge to . Then, if and is large enough, the chance of making a mistake in or is so small that the expected time spent on repairs is constant. ∎

Thus, the expressive encoding yields asymptotically optimal adaptation complexity. The reader may be concerned that this encoding takes longer to initially reach a global optimum than the direct encoding. There are many ways to address this issue, e.g., by including mutations that enable to grow in size over time. However, the next problem shows an even sharper advantage: Direct encoding spends exponential time away from the optimum, while the expressive encoding is asymptotically unaffected.

###### Problem 2 ().

Random Flipping Challenge (RFC): Take any two target phenotypes , where is the complement of . At each time the current target vector is selected to be or with 0.5 probability each. The fitness is the number of bits in the phenotype that match the target.

This is the same as Problem 1, but with the target vector flipping randomly at each generation.

###### Theorem 5.3.

The (1+)-EA with direct encoding and has adaptation complexity on the random flipping challenge.

Proof sketch. Lower bound the hitting time of either target, using the fact that when within bits of the target, the chance of moving towards it is less than half that of moving away. ∎

###### Theorem 5.4.

The (1+)-EA with GP encoding and has adaptation complexity on the random flipping challenge.

Proof sketch. With , can be chosen large enough so the chance of moving towards the correct target is always more than twice that of moving away, which can be used to upper bound the hitting time. Then, as in Theorem 5.2, once and have initially converged, the expected time spent on repairs is constant. ∎

The theoretical results for Problems 1 and 2 are validated experimentally in Figure 4(c-d). Consistent with its adaptation complexity, the GP encoding spends an increasing overall percentage of time at optimal fitness. The power of expressive encodings also manifests on static fitness functions, like the one in Problem 3.

###### Problem 3 ().

Large Block Assembly Problem (LBAP): In this problem, there are two targets hidden somewhere amongst the bits at non-overlapping indices, with . If the solution contains both targets the fitness is , otherwise it is the maximum number of bits matched to either target.

That is, there are solutions of size in disjoint subspaces that must be discovered and combined. This problem is challenging for a direct encoding: Once one target is found there is no fitness gradient until all remaining bits are matched, which takes exponential time to happen by chance. Since the fitness function is fixed, the metric of interest is simply expected time to reach the optimum.

For this problem, there is no dynamism in the fitness function to provide exploratory power to the (1+)-EA, so some basic mechanisms must be added to prevent it from getting stuck. Instead of simply accepting if the candidate has higher fitness, the champion is replaced if any of the following conditions are met:

• Fitness. The candidate has greater fitness.

• Diversity. The candidate phenotype is further than Hamming distance one from the champion phenotype.

• Sparsity. The candidate phenotype has equal fitness and fewer ones than the champion phenotype.

With these rules, the algorithm can still be subject to deception, so the champion is reinitialized after steps if it has not yet converged. Call this adjusted algorithm (1+)-EA*. The pitfalls of direct encoding are too great for these methods to help much in that case.

###### Theorem 5.5.

The (1+)-EA* with direct encoding and converges in steps on the large block assembly problem.

Proof sketch. Once one target is found, the algorithm is stuck, so both targets can only be found together by chance. ∎

However, using genotypes with the GP structure shown in Figure 4(b), with , leads to tractability.

###### Theorem 5.6.

The (1+)-EA* with GP encoding and converges in steps on the large block assembly problem.

Proof sketch. Consider convergence paths where only occurs at the final step. Compute the probability of such convergence, and use restarts to get the expected run time. ∎

These theoretical conclusions are demonstrated experimentally in Figure 4(e). In 100 independent trials, with the direct encoding, the algorithm never reaches beyond half the optimal fitness; with GP encoding it always makes the final large jump before the 2M evaluation limit.

### 5.3. Extensions

The algorithms in this section worked with fixed GP structures, evolving the values within them. This approach is sufficient to demonstrate the power and potential of expressive encodings; future work will analyze methods that evolve the structure of representations as well. However, note that the above structures are each of a constant size: One could simply enumerate all such structures, trying the algorithm on each, and only pay a constant multiplicative cost, not affecting the asymptotic complexities. Although this is a brute-force approach, it suggests that powerful structures may actually not be so difficult to find, and indeed meaningful structures are commonly found by existing GP methods (schmidt2009distilling; shahrzad2015tackling).

The problems in this section all required jumps of size . Prior work with direct encodings has sought to tackle larger and larger jumps, but they are still generally sublinear (bambury2021generalized; dang2016escaping). The closest comparison with SGOs (watson2007building) had jumps of size , but required a representation that was a priori well-aligned for two-point crossover, and an island model with a number of islands dependent on the . In contrast, with an expressive encoding, successful jumps of size can be achieved with only single-point mutation in a (1+1 or 2)-EA, and simple sparsity and diversity methods.

The diversity mechanism used for Problem 3 could also be used to generalize the first two problems to any two target vectors. This simple diversity mechanism is an instance of a behavior domination approach (meyerson2017discovering), in that solutions at least a fixed distance from the current solution are non-dominated.

## 6. Discussion and Future Work

The definitions and analysis of expressive encodings in this paper can be seen as a starting point for future work in several areas:

Biological Interpretation. Because they can capture complex reproductive distributions, expressive encodings could in general be more accurate than direct encodings in representing complex models of biological evolution. If genetic regulatory networks are universal function approximators (models of them often are, e.g., recurrent NNs, ODEs, and boolean networks (crombach2008evolution; karlebach2008modelling; yaghoobi2012review)), then Theorem 4.5 suggests how natural evolution can become arbitrarily powerful over time. In the expressive encodings with crossover constructions in Section 4, the high level of shared structure across parents is also consistent with nature. Humans share ¿99% of their DNA, and high proportions of DNA are even shared across species (collins1997variations; hardison2003comparative; simakov2022deeply). This construction may partly explain why crossover is so prominent in biology: It works best when a large part of the genome is shared. Another surprising result from Section 5 is that for expressive encodings in dynamic environments, generating multiple offspring per generation is critical to achieving stable performance. This result makes biological sense as well: Even when the probability of a good offspring is high, if there is any chance that a bad one can set you back considerably due to latent nonlinearities in the genome, it is prudent to have multiple offspring.

Theory. The theory in this paper did not use any deep EA theory methods, such as drift methods (doerr2012multiplicative; doerr2013adaptive; lengler2020drift). Such methods could enable rapid extension of these results to other encodings, operators, and test domains. To highlight the mechanisms that make expressive encodings powerful, the test domains were idealized to contain two complementary targets. Future work can generalize this approach to more targets, and to compositions of targets that have not been seen before, akin to generalization in machine learning. Section 5 demonstrated a type of stability, i.e., preventing catastrophic forgetting in particular cases. How far can such results be generalized? For instance, can catastrophic forgetting be asymptotically avoided in individuals as complex as the ones constructed in Section 4? What other kinds of stability are possible?

Practice. In building practical applications, it is not clear whether the first step should use incrementally growing encodings like those in GP and NEAT (stanley2002evolving), or giant fixed structures with many evolvable parameters. The EA community has tended to prefer incremental encodings, but deep learning has seen remarkable success via huge fixed structures with canonical form and learnable parameters. Perhaps expressive encodings with a fixed structure would be an easier place to start, since understanding the behavior of the system may be easier without the variable of dynamic genotypic structure. Also, one of the common motivations for indirect encodings is that the genotype can be smaller than the phenotype. However, in the constructions of Section 4 the genotype is generally much larger than the phenotype, which suggests that directly evolving huge genotypes could be promising (similar to deep GA (such2017deep)).

Open-ended Evolution. The power of expressive encodings comes from the ability of the transmission function to improve throughout evolution. This phenomenon has been previously explored in GP (altenberg1994evolution; hu2011robustness) and neuroevolution (huizinga2018emergence; lehman2013evolvability), and insights there should be useful in analyzing the behavior of expressive encodings more generally. E.g., motivated by Evolvability-ES, a construction related to miracle jumps for directly-encoded NNs has been developed for i.i.d. perturbations (gajewski2019evolvability, Thm. S6.1). As a longer-term opportunity, the promise of expressive encodings to continually complexify and innovate upon the transmission functions indicates that they should be a good substrate for open-ended evolution (hintze2019open; soros2014identifying; stanley2019open; stepney2021modelling; taylor2016open; wang2020enhanced), and, likewise, insights from open-endedness could be useful in developing more practical SGO + Expressive Encoding algorithms. Thus, the long-term ideal is an algorithm that never needs to be restarted: press ‘go’ once and over time it becomes better and better at solving problems as they appear, accumulating problem-solving ability in the genotypes themselves. Nature has provided an existence proof of such an algorithm. For organisms to become boundlessly and efficiently more complex over time, the mapping from parents to offspring must become more complex over time. Expressive encodings are an important piece of this puzzle.

## 7. Conclusion

This paper identifies expressivity as a fundamental property that makes encodings powerful. It allows local changes in the genotype to have global effects in the genotype, which is useful in making large jumps in particular, and approximating arbitrary probability distributions in general. Direct encodings are not generally expressive, whereas genetic programming and neural network encodings are. As demonstrated in three problems illustrating different challenges, expressive encodings may bring striking improvements in adaptation and make intractable problems tractable. We believe expressivity provides a productive perspective for further research on many aspects of evolutionary computation.

## Appendix A Proofs for Section 4

###### Theorem 4.1 ().

Genetic programming is an expressive encoding for uniform crossover, with complexity .

###### Proof.

Let , be the phenotypes of the two parent arguments to the uniform crossover operator . Choose the parent genotypes , to be the two programs shown in Figure 2(a). These two programs differ only in their value of , so the child will be equivalent to its parents, except its value of will be uniformly sampled from .

When interpreted as a binary integer, this value of is an integer uniformly sampled from , where is the length of the vector . Let each be a binary vector of length , each of which can also be interpreted as an integer, with . Now, the approach is to choose large enough and then choose the ’s so that each is generated when the integer falls in a given range, which occurs with probability within of .

Let = . Then, there are integers each sampled with probability . Let , . Then,

 (4) pi−ϵ2

In order to ensure that the parent genotypes generate their required phenotypes, the single integers and are assigned to and . The probabilities of sampling these are each less than , and are subtracted from the probabilities for and potentially , so that for ,

 (5) pj−ϵ

In the parent genotypes, each conditional contains bits (to encode and ) and there are conditionals. Since , the required total genotype size is . ∎

###### Theorem 4.2 ().

Genetic programming is an expressive encoding for single-point mutation, with complexity .

###### Proof.

Let be the phenotype of the single parent argument to the single-point mutation operator . Choose the parent genotype to be the program shown in Figure 2(b), with each a binary vector of zeros . Since is false for all in , . Our approach is to choose the length of each so that the probability of selecting a bit to flip in is approximately .

When chooses a bit to flip, it selects the bit uniformly from the ’s and the ’s. The total size of the ’s is . So, let

 (6) m∑i=1Li=⌈mn2ϵ⌉.

Then, the ’s can be apportioned proportionally to such that the chance of flipping a bit in out of all ’s is within of . The chance of instead flipping a bit in the ’s is less than , so the overall probability of flipping a bit in is within of .

In this construction, the size of the ’s dominate the ’s, giving us a complexity of . ∎

Note that the parent in Figure 2(b) could have been more like Parent 1 in Figure 2(a), using a single long with ¡. Instead, the parity function ‘par’ is used, because it demonstrates an alternative kind of construction, and because it is used again in Section 5.

###### Theorem 4.3 ().

Feed-forward neural networks with sigmoid activation are an expressive encoding for uniform crossover, with complexity .

###### Proof.

This construction is similar to the uniform crossover construction for genetic programming. Let the parents , be defined by the four-layer neural networks shown in Figure 3(a), where each internal node is followed by sigmoid activation, and all biases are 0.

The two parents differ only in the values of their second layer of weights. Since all the input weights to the first layer are zero, the output of each node in the first layer is . Then, since each power of two is included with probability , uniform crossover makes the input to the bottleneck equal 0.5 times an integer sampled uniformly from

. Each such value uniquely defines the output of the bottleneck neuron.

Similarly to the GP case, a set of thresholds is then created. First, is set to amplify the differences between successive outputs from the bottleneck, and the biases of the threshold units are then set so that they output nearly one if the threshold is met and nearly zero otherwise. For brevity, we say that the unit “fires” if its output is nearly one.

There is a switch neuron for each desired . By setting as large as we want, this switch will fire only if its lower threshold unit fires, but its upper one does not. Since the thresholds are monotonically increasing, exactly one switch unit will fire.

Finally, the weights from the switch for to the output of the whole network are , with arbitrarily large. So, if the switch from fires, the input to the th output node will be negative if the the element of is 0, and positive if the the element of is 1. Therefore, after squashing through the final sigmoid, the output of the entire network will be nearly , which after rounding yields exactly (see Equation 1). That is, the NN outputs the binary vector .

Note that similar considerations as in the case of GP can be taken into account to ensure that and .

As in the case of GP, it is necessary that . There are weights in the first two layers, in the next two layers, and in the final layer. So, in terms of number of weights, the complexity of this construction is . The apparent improvement over the GP construction comes from the fact that real-numbered weights were used instead of bits. ∎

###### Theorem 4.4 ().

Feed-forward neural networks with sigmoid activation are an expressive encoding for single-point mutation, with complexity .

###### Proof.

Take the single parent for mutation to be Parent 2 in Figure 3(a). In fact, in this case, the exact weights in the first two layers of do not matter, as long as the weights in the second layer are non-zero. A weight in the first two layers is selected for mutation with probability of at least , by choosing a large enough . Now, no matter what the weights are in the first two layers of , the stochastic process of (1) selecting one of these weights, and (2) adding Gaussian noise to this selected weight yields some continuous distribution over the output of the bottleneck unit. Similar to the case of crossover, the biases of the threshold units can be set to partition the output distribution of the threshold unit appropriately, again setting and high enough to make the output of each unit nearly one or nearly zero. It is also possible to include threshold units for the case of no mutation to ensure that . Similar to the construction for GP with mutation, the complexity of this construction is dominated by , i.e., the overall complexity is . ∎

###### Theorem 4.5 ().

is an expressive encoding for uniform crossover.

###### Proof.

As in the previous constructions for uniform crossover, choose . Let and . Again, since the parents differ only in their values of , the child will be of the form where is drawn uniformly from . It is clear from previous constructions that values in can be apportioned to assign accurate-enough probability to each , and can thus be chosen so that it assigns probability in this way. That is, is partitioned into subsets so that , and is chosen so that if then , while also ensuring that and . ∎

###### Theorem 4.6 ().

Direct encoding of feed-forward neural networks with sigmoid activation is an expressive encoding for uniform crossover.

###### Proof.

The phenotype is directly encoded as a neural network. That is, the phenotype space consists of functions , where each can be represented as a neural network with . Given a desired set of such functions and associated probabilities with , the goal is to find directly-encoded parents and whose crossover results in the desired distribution.

A similar approach can be taken as with previous constructions, but with direct encodings of neural networks, the source of randomness can be placed in the biases of nodes whose input weights are all zero, so that they effectively serve as auxiliary inputs to the model.

Let and be the two parents shown in Figure 3(b). Let the first hidden layer of both and have units . For let the incoming weights to all be zero except the weight from the th input, which is set to 1. Let the biases of all be 0. For let the incoming weights to all be zero. Let the biases of all be 0 in and all be 1 in . Let the remainder of and , denoted , be shared.

Then, a child generated by will be equivalent to its parents, except the biases of will be sampled uniformly from . Choosing , since feed-forward NNs are universal function approximators, can again be selected so that each is sampled with approximately the desired probability—that is, the sampled biases tell which function to compute. Importantly, the sigmoid activation after the first layer of and does not lose any information about the input, since it is continuous and strictly monotonic, so the input to still uniquely identifies the input to the whole network. ∎

## Appendix B The (1+λ)-Evolutionary Algorithm

Algorithm 1 provides pseudocode for the algorithm used in Section 5. is the encoding and is the fitness function.

## Appendix C Proofs for Section 5

The proofs in this Section consider the (1+)-EA (Algorithm 1), with .

###### Theorem 5.1 ().

The (1+)-EA with direct encoding has adaptation complexity on the deterministic flipping challenge.

###### Proof.

This follows from a standard coupon collector argument (doerr2020probabilistic), where it takes steps to move to a point bits away. Increasing to 2 does not improve the complexity. ∎

###### Theorem 5.2 ().

The (1+)-EA with GP encoding has adaptation complexity on the deterministic flipping challenge.

###### Proof.

W.l.o.g. assume is the vector of ones and is the vector of zeros. The crux of the proof is showing that, once the algorithm converges initially, if is at least 2, it is possible to pump large enough so that the expected time spent recovering from forgetting is constant.

##### Initial Convergence

Let be initialized uniformly at random. Say the initial target is the ones vector . Let denote the 1-norm, i.e., number of 1’s in the vector.

Case 1: Suppose and . Then the only accepted children are ones that flip a 0 to a 1 in . In this case, will converge to in steps, by a classic coupon collector argument.

Now the target flips to the zeros vector 0. Suppose . Then, in a constant number of steps a bit of is flipped, thereby returning . (If, in the meantime a bit of is accidentally flipped, the expected cost to repair this mistake can also be constant, as shown in the next section). Since , now the only accepted children will be ones that flip a 1 to a 0 in .

Case 2: Suppose and . By symmetry, the same result is obtained as in Case 1.

There is a probability that a random initialization results in Case 1 or Case 2. If not, Case 1 or 2 can be established by either flipping a bit of or flipping bits of whichever of or is being returned until it has more 1’s than its counterpart.

The expected time to flip a bit of is , which is when , so this initial time to get to Case 1 or 2 is negligible. The overall expected time to convergence is then .

##### Repeated Convergence

After initial convergence, w.l.o.g. suppose , , and , so the current target is , and the fitness of the champion is now .

A new candidate can be generated in three ways:

Case 1: flip a bit of , so and the process is done.

Case 2: flip a bit of , so .

Case 3: flip a bit of , so .

Only Case 2 is of concern, since in this case an important bit of can be forgotten when the new candidate is accepted. In this case, up to mistakes can be made in before a bit is finally flipped in . The chance of making mistakes in a row vanishes quickly, but even in the worst case, it would only take steps to make the repairs. So, the expected time needed to make repairs is less than the probability of making at least one mistake times the cost of making repairs, i.e., (repairs) = (mistake) .

Suppose . Then,

 (7) Pr(mistake)=n2n+L,~{}and
 (8) E(repairs)=O(nn+L)⋅O((n+L)logn)=O(nlogn).

That is, steps are spent on repairs every time the target flips, which is no better than the direct encoding case.

Suppose instead . Then,

 (9) Pr(mistake)=(n2n+L)2,~{}and
 (10)

So, choosing leads to

 (11) E(repairs)=O(n2lognn+L)=O(1).

Thus, both the expected time to hit Case 1 and the expected time to make any necessary repairs are constant. ∎

###### Theorem 5.3 ().

The (1+)-EA with direct encoding and has adaptation complexity on the random flipping challenge.

###### Proof.

W.l.o.g., suppose and are the vectors of all zeros and all ones. Suppose the champion contains ones at time , and generates candidate at time . If the target at time is all-ones, and a zero in is flipped to a one, then the change is accepted This flip happens with probability . If the target at time is all-zeros, and a one in is flipped to a zero, then the change is accepted. This flip happens with probability . The is kept with the remaining probability, which is greater than .

Now, if , then the chance of moving towards the target is less than half that of moving away. Note that with , this sufficient is even higher, i.e., . So, suppose , and let be the expected hitting time of reaching all zeros starting with ones. Clearly, . Suppose , and for some . Now,

 (12) h0i ≥1+16h0i−1+12h0i+13h0i+1 (13) ⟹h0i ≥1+16h0i−1+12h0i+13(ci+1+h0i) (14) ⟹h0i ≥6+4ci+1+h0i−1=ci+h0i−1.

Now, , so . That is, the pull towards the center is so strong that even when the champion is one bit away from a target, the expected time to a target is . When at a target, the chance of moving away from it is constant, so this time to again reach a target dominates the adaptation complexity. ∎

###### Theorem 5.4 ().

The (1+)-EA with GP encoding and has adaptation complexity on the random flipping challenge.

###### Proof.

W.l.o.g. assume is the vector of ones, is the vector of zeros, and suppose that and in . The question is how long it takes for to become all ones. The is improved if the target is all-ones a zero is flipped in and the change is accepted; call this a fix. The is hurt if the target is all-zeros and a one is flipped in and the change is accepted; call this a break. Let . A break occurs if both candidates make a mistake, or one does and the other is neutral (i.e., it flips a bit in ). So,

 (15) Pr(break)=12(|b|2+2|b|n(2n+L)2).

The probability of a fix is the chance that neither does not make an improvement:

 (16) Pr(fix)=12(1−(2n+L−(n−|b|)2n+L)2).

Of interest is the ratio . Choose . Then, with some algebra, it can be seen that for all .

Now, let be the expected hitting time of to reach all-ones starting from ones. For brevity, say given ones is . Then,

 (17) h0=1+s0h1+(1−s0)h0⟹h0=1s0+h1=c0+h1.

Suppose . Then,

 (18) hi ≤1+si2hi−1+sihi+1+(1−3si2)hi (19) ⟹hi ≤1si+12ci−1+hi+1=ci+hi+1.~{}Then,
 (20) h0 =(2n+Ln)+n∑i=1ci≤(2n+Ln)+n∑i=11si+12ci−1 (21) ≤(2n+Ln)+n∑i=12n+Ln−i+12i(2n+Ln) (22) =O(Ln)+O((n+L)logn)=O((n+L)logn).

So, the complexity for to converge from any starting point is the same as in Theorem 5.2, and, as in Theorem 5.2, if ,

 (23) Pr(mistake)=O((nn+L)2),% and
 (24)

So, again choosing leads to

 (25) E(repairs)=O(n2lognn+L)=O(1),

and if and are both converged, then there is a constant chance of sampling the correct target at each step. ∎

###### Theorem 5.5 ().

The (1+)-EA* with direct encoding and converges in steps on the large block assembly problem.

###### Proof.

Since only a single bit is flipped, the diversity method has no effect on this problem. The sparsity method is not useful either: It only biases the search towards 0’s, which is not helpful in general; in the worst case, both hidden targets are all 1’s. In this case, the algorithm cannot make any progress once one of the targets is found, unless the state is already only one bit away from reaching the second target. So, the best strategy is to restart every iteration, i.e., set , leading to a convergence time of . ∎

###### Theorem 5.6 ().

The (1+)-EA* with GP encoding and converges in steps on the large block assembly problem.

###### Proof.

Call the two hidden targets and . W.l.o.g., suppose occurs in the first bits of a solution, and occurs in the last bits. Let and , where . Since , refer to the scalar contents of and as and , respectively.

Upon random initialization, there is a constant probability that , , and either or . We are interested in how long it takes to reach the state , , , , and , since this is a state of maximal fitness.

However, if a state with is reached before and , then the states in and can be pulled in the wrong direction, due to the interaction between and caused by ‘’.

So, suppose that the algorithm reaches a state with , , , and , before ever reaching a state with . Consider, then, the remaining three cases, which the algorithm alternates between as it converges:

Case 1: and . In this case, a new solution is kept if (1) a bit in is flipped that makes closer to (due to the fitness condition, since the fitness will increase), if (2) a bit in is flipped from a 1 to a 0 (due to the sparsity condition, since the sparsity will increase), or if (3) is flipped, which results in Case 2 (due to the diversity condition, since disabling flips more than one bit in ).

Case 2: and . In this case, a new solution is kept if or is flipped, resulting in Case 1 and Case 3, respectively (diversity condition). No other flips affect the phenotype.

Case 3: and . In this case, a new solution is kept if (1) a bit in is flipped that makes closer to (fitness condition), if (2) a bit in is flipped from a 1 to a 0 (sparsity condition), or if (3)