GPFramework
A simple and flexible mutation-only Genetic Programming framework.
view repo
Genetic Programming (GP) is an evolutionary computation technique to solve problems in an automated, domain-independent way. Rather than identifying the optimum of a function as in more traditional evolutionary optimization, the aim of GP is to evolve computer programs with a given functionality. A population of programs is evolved using variation operators inspired by Darwinian evolution (crossover and mutation) and natural selection principles to guide the search process towards better programs. While many GP applications have produced human competitive results, the theoretical understanding of what problem characteristics and algorithm properties allow GP to be effective is comparatively limited. Compared to traditional evolutionary algorithms for function optimization, GP applications are further complicated by two additional factors: the variable length representation of candidate programs, and the difficulty of evaluating their quality efficiently. Such difficulties considerably impact the runtime analysis of GP where space complexity also comes into play. As a result initial complexity analyses of GP focused on restricted settings such as evolving trees with given structures or estimating the quality of solutions using only a small polynomial number of input/output examples. However, the first runtime analyses concerning GP applications for evolving proper functions with defined input/output behavior have recently appeared. In this chapter, we present an overview of the state-of-the-art.
READ FULL TEXT VIEW PDF
Here we propose an evolutionary algorithm that self modifies its operato...
read it
Recently it has been proved that simple GP systems can efficiently evolv...
read it
We discuss how to use a Genetic Regulatory Network as an evolutionary
re...
read it
For theoretical analyses there are two specifics distinguishing GP from ...
read it
Despite many successful applications, Cartesian Genetic Programming (CGP...
read it
We evolve binary mux-6 trees for up to 100000 generations evolving some
...
read it
We show how the characteristics of the evolutionary algorithm influence ...
read it
A simple and flexible mutation-only Genetic Programming framework.
Genetic Programming (GP) is a class of evolutionary computation techniques to evolve computer programs originally introduced by Koza [16]
. GP uses genetic algorithm mutation, crossover and selection operators adapted to work on populations of program structures. Program fitness is evaluated using a
training set consisting of samples of program inputs and the corresponding correct outputs. The goal of a GP system is to construct a program which, as well as producing the correct outputs on the inputs included in the training set, generalizes well to the other possible inputs.In standard tree-based GP, as introduced by Koza, programs are expressed as syntax trees rather than lines of code, with variables and constants (collectively referred to as terminals) appearing as leaves in the tree, and functions (such as +, *, and cos) appearing as internal nodes. New programs are produced by mutation (applying changes to a copy of a parent solution) or crossover (replacing a subtree in one parent solution with a subtree from another parent). Several other variants of GP exist that use different representations than tree structures. Popular ones are Linear GP [1], cartesian GP [27], and Geometric Semantic GP (GSGP) [30]. Since most of the available complexity analysis results focus on tree based GP this is where we keep our focus in this chapter. Work on GSGP is an exception that we will also consider [32].
One of the main points regarding GP made by Koza is that a wide variety of different problems from many different fields can be recast as requiring the discovery of a computer program that produces some desired output when presented with particular inputs [16]
. Ideally, this process of discovery could take place without requiring a human to explicitly make decisions about the size, shape, or structural complexity of the solutions in advance. As GP systems provide a way to search the space of computer programs for one which is well-adapted to solving (or approximating) the problem at hand, they are thus applicable to a wide variety of problems, including those in artificial intelligence, machine learning, adaptive systems, and automated learning. GP has produced human-competitive results or patentable solutions on a large number of diverse problems, including the design of quantum computing circuits
[48], antennas [22], mechanical [20], and optical lens systems [18]. From these results, Koza observes that Genetic Programming may be especially productive in areas where little information about the size or shape of the ultimate solution is known, while large amounts of data and good simulators are available to measure performance of candidate solutions.While there are many examples of successful applications of GP (see [17] for an overview), the understanding of how such systems work and on which problems they are successful is much more limited. Compared to traditional evolutionary algorithms for function optimization, GP applications are further complicated by two additional factors: the variable length representation of candidate programs, and the difficulty of evaluating their quality efficiently since it is prohibitive or even impossible to test programs on all possible inputs. Such difficulties, naturally, considerably impact the runtime analysis of GP where space complexity also comes into play. As a result, while nowadays the analysis of standard elitist [3] and non-elitist genetic algorithms [36, 37, 2] has finally become a reality, analyzing standard GP systems is far more prohibitive. Indeed, McDermott and O’Reilly [26] remark that “due to stochasticity, it is arguably impossible in most cases to make formal guarantees about the number of fitness evaluations needed for a GP algorithm to find an optimal solution.” Similarly to how the analysis of simplified evolutionary algorithms (EAs) has gradually led to the achievement of the techniques that nowadays allow the analysis of standard EAs, Poli et al. suggested “computational complexity techniques being used to model simpler GP systems, perhaps GP systems based on mutation and stochastic hill-climbing” [45].
Following this guideline the first runtime analyses laying the groundwork for better understanding of GP considered simplified algorithms primarily based on mutation and hill-climbing (i.e., the algorithm introduced in [7]). However, further simplifications compared to GP applications in practice were necessary to deal with the additional difficulties introduced by the variable-length of GP solutions, the stochastic fitness function evaluations when using dynamic training sets, and the neighborhood structure imposed by the GP mutation and crossover operations acting on syntax trees. Indeed Goldberg and O’Reilly observed that “the methodology of using deliberately designed problems, isolating specific properties, and pursuing, in detail, their relationships in simple GP is more than sound; it is the only practical means of systematically extending GP understanding and design” [11]. To this end, the first runtime analyses of GP considered the time required to evolve particular tree structures rather than proper computer programs. In particular, solution fitness was evaluated based on the tree structure rather than by executing the evolved syntax tree. Problems belonging to this category are ORDER, MAJORITY [7] and SORTING [52]. Already in such simplified settings the characteristic GP problem, bloat (i.e., the continuous growth of evolved solutions that is not accompanied by significant improvements in solution quality), may appear.
In GP applications the set of all possible inputs is generally either too large to evaluate the exact solution quality efficiently, or not much of it is known (i.e., only a limited amount of information about the correct input/output behavior is available). As a result the performance of the GP system is usually considered in the probably approximately correct (PAC) learning framework
[50], to show that the solution produced by the GP system generalizes well to all inputs. Kötzing et al. isolate this issue when they present the first runtime analysis of a GP system in this framework [14]. They consider the problem of learning the weights assigned to bits of a pseudo-Boolean function (i.e., the identification problem), proving that a simple GP system can discover the weights efficiently even by using a limited sample of the possible inputs to evaluate solution quality.A more realistic problem where the program output, rather than structure, is used as the basis for determining solution quality is the MAX problem [15], originally introduced in [10]. The problem is that of evolving a program which, given some mathematical operators and constants (the problem admits no variable inputs) outputs the maximum possible value subject to a constraint on program size.
Recently, time and space complexity of the has been analyzed for evolving Boolean functions of arity [25, 21]. Solution quality was evaluated by comparing the output of the evolved programs to the target function on the entire truth table for the target function, or on a polynomially sized training set. The analyses show that while conjunctions of variables can be evolved efficiently (either exactly, using the complete truth table as the training set, or in the PAC learning framework when using smaller training sets), parity functions of variables cannot. These results represent the first rigorous complexity analysis of a tree-based GP system for evolving functions with actual input/output behavior.
We will also consider the theoretical work on GSGP, where the variation operators used by the GP system are designed to modify program semantics rather than program syntax.
This chapter presents an overview of the state-of-the-art. It is structured as follows. In Section 2, we introduce the , the GP system used for most available complexity analysis results. In Section 3, we consider results where the GP system is tasked with evolving tree structures with specific properties (the ORDER, MAJORITY, and SORTING problems). In Section 4, we present results where GP systems evolve programs with limited functionality: the MAX problem in Subsection 4.1 and the identification problem considered in Subsection 4.2. Section 5 presents results for GP evolving Boolean functions of arity . Section 6 presents a brief overview of the complexity analysis results available for Geometric Semantic Genetic Programming algorithms. Finally, Section 7 presents a summary of the complexity results and discusses the open directions for future work.
In this chapter, we will primarily consider the behavior of the simple (1+1) GP algorithm (Algorithm LABEL:alg:OneOneGP), which represents programs using syntax trees and uses the HVL-Prime operator (Algorithm LABEL:alg:HVLPrime) to perform mutations. The algorithm maintains a population of one individual (initialized either as an empty tree, or a randomly-generated tree of a particular size), and at each generation chooses between the parent and a single offspring generated by HVL-Prime mutation. This simple algorithm was already considered in early comparative work between standard tree-based GP and iterated hill-climbing versions of GP [40, 39, 41].
algocf[t]
The HVL-Prime mutation operator, introduced in [7] and shown as Algorithm LABEL:alg:HVLPrime here, is an updated version of the HVL (Hierarchical Variable Length) mutation operator [39]. It is specialized to deal with binary trees and is designed to perform similarly to bitwise mutation in evolutionary algorithms. The original motivation to use the HVL-Prime operator was that of making the smallest alterations possible to GP trees while respecting the key properties of the GP tree search space: variable length and hierarchical structure.
algocf[t]
A single application of HVL-Prime selects uniformly at random one of three sub-operations – insertion, substitution, and deletion – to be applied at a location in the solution tree chosen uniformly at random, selecting additional functions or terminals from the sets and of all available functions and terminals as required. The sub-operations are illustrated in Figure 1: substitution can replace any node of the tree with another node chosen uniformly at random from the set of terminals or the set of functions (if the replaced node was a terminal or a function respectively), insertion inserts a new leaf and function node at a random location in the tree, and deletion can remove a random leaf (replacing its parent with its sibling).
DEL | ||
We note that for problems with trivial function or terminal sets (i.e., those that contain only one element), the substitution operator is typically restricted to only select among those nodes which can be replaced with something other than their current content, avoiding the situation where the only option is to substitute a function or terminal node with a copy of itself. This restriction does not typically affect asymptotic complexity analysis results, as the only effect of allowing such substitutions is that approximately of the HVL-Prime applications will not alter the current solution.
In this chapter, we refer to Algorithm LABEL:alg:OneOneGP, with , as the , differentiating it from the simpler local search variant which always uses , which we call RLS-GP ^{2}^{2}2In previous work, the name was used for both algorithms, relying either on explicitly specifying or using a suffix like -multi and -single to distinguish between the two variants. Our notation matches the conventions of runtime analysis of evolutionary algorithms [38, 12]..
algorithms do not use crossover or populations. Instead, larger changes to the current solution can be performed by multiple applications of the HVL-Prime operator without evaluating the fitness of the intermediate trees produced within an iteration. Since each application of HVL-Prime selects a location in the tree it will modify independently, it is possible for this procedure to mutate the parent tree in several places, rather than only modifying a single subtree (which would be the case for the standard GP’s subtree mutation operator, which replaces a random subtree of the parent program with a randomly-generated subtree [44]).
Algorithm LABEL:alg:OneOneGP depicts the non-strictly elitist variant of , which accepts offspring as long as they do not decrease the fitness of the current solution. We use (and equivalently RLS-GP) to refer to the strictly elitist variant of the algorithm, which only accepts offspring which have strictly better fitness when compared to the current solution.
The difference between the elitist and non-elitist variants is often significant in how the algorithms cope with bloat problems. The algorithms operate with a variable-length representation of their current solution: as mutations are applied, the number of nodes in the tree may increase or decrease. Poli et al. define bloat as program growth without (significant) return in terms of fitness [44]. Bloat can reduce the effectiveness of GP, as larger programs are potentially more expensive to evaluate, can be hard to interpret, and may reduce the effectiveness of the GP operators in exploring the solution space. For example, if a large portion of the current solution is non-executable (perhaps inside an if statement with a trivially false condition), mutations applied inside this portion of the program would not alter its behavior, and hence are not helpful in attempting to improve the program.
Common techniques used to control the impact of bloat include modifying the genetic operators to produce smaller trees and considering additional non-fitness related factors when determining whether an offspring should be accepted into the population. The latter can include imposing direct limits on the size of the accepted solutions (by imposing either a maximum tree depth or a maximum tree size limit), rejecting neutral solutions, or a parsimony pressure approach [44], which prefers smaller solutions when the fitness of two solutions is equal.
Two bloat control approaches that frequently appear in theoretical analyses of GP algorithms are lexicographic parsimony pressure and Pareto parsimony pressure [23]. The former mechanism breaks ties between equal-fitness individuals (e.g. in line 7 of Algorithm LABEL:alg:OneOneGP) by preferring solutions of smaller size, whereas the latter treats fitness and solution size as equal objectives in a multi-objective approach to optimization, suggesting that the GP system maintains a population of individuals which do not Pareto-dominate each other.
In the GP problems analyzed in this chapter, the correct behavior of the target program is known for all possible inputs. Additionally, in most of the problems, the GP systems considered are able to evaluate program quality on all possible inputs efficiently. Both of these assumptions simplify the analysis, but may not be practical in the real world applications of GP: the correct output of the target function might only be known on a limited number of the possible inputs, and/or it might not be practical to evaluate the candidate solutions on all of the known inputs. Nevertheless, considering the performance of GP in this setting represents an important first step: systems which are unable to evolve the a program with the desired behavior using a fitness function which considers all possible inputs are unlikely to fare better when using a limited approximation. Additionally, fully deterministic outcomes for solution fitness comparisons simplify the analysis of the GP systems, allowing their behavior to be described in greater detail.
When the exact fitness is not available, performance of GP is analyzed in the PAC-learning framework [50]. This considers the expected performance of the GP-evolved program on inputs it may not have encountered during the optimization process. In this framework, GP evaluates solution fitness by sampling input/output examples from a training set during the optimization process, while the goal is to produce a program with a low generalization error, i.e., with a good probability of producing correct output on any randomly sampled solution, including ones that have not been sampled during its construction. The number of samples used to compare the quality of solutions is an important parameter in this setting, potentially trading evaluation accuracy for time efficiency.
While the GP algorithm may evaluate solution fitness by relying on a static training set of polynomial size, for instance chosen at random from the set of all known inputs/outputs at the start of the optimization process, Poli et al. note that in some circumstances doing so “may encourage the population to evolve into a cul-de-sac where it is dominated by offspring of a single initial program which did well on some fraction of the training cases, but was unable to fit the others” [44, Chapter 10]. To counteract this, GP systems can also, when the amount of training set data available is sufficient, opt to compare program quality on samples chosen from the available data for each comparison [9]. The complexity of these subset selection algorithms varies from simply selecting inputs/outputs at random (in the case of Random Subset Selection), to attempting to identify useful inputs/outputs based on the current or previous GP runs (Dynamic or Historical Subset Selection respectively), to hierarchical combinations of these approaches [4].
In this section, we review the computational complexity results concerning the analysis of GP systems for the evolution of trees with specified properties, rather than the result of running the evolved program on any particular input. The specific property that the evolved tree should satisfy depends on the problem class. The possibility of calculating the fitness of candidate solution trees without explicitly executing the program was regarded as a considerable advantage since more realistic problems were deemed to be far too difficult for initial computational complexity analyses.
The earliest analysis for the evolution of tree structures considered two separable problems called ORDER and MAJORITY. The problems, originally introduced by Goldberg and O’Reilly [11], were considered as “two much simplified, but still insightful, problems that exhibit a few simple aspects of program structure” [7]. They were minimally sufficient to capture relevant GP properties such as the existence of multiple optimal solutions. Specifically, ORDER and MAJORITY where respectively introduced as abstracted simplifications of the eliminative expression that takes place in the conditional statements (where the presence or absence of some element may eliminate others from evaluation, e.g. by making it impossible for program execution to reach the body of an if statement with an always false condition) and the accumulative expression present in many GP applications such as symbolic regression (where the GP system is able to accumulate information about the correct solution from the aggregate response of a large number of variables). In particular, the ORDER problem was meant to reflect conditional programs by making it impossible to express certain variables by inserting them at certain tree locations (representing portions of the program which might not ever be executed), while MAJORITY requires the identification of the correct set of solution components out of all possible sets. For both problems the fitness of a candidate solution is determined by an in-order traversal of its syntax tree.
Another problem considered in the literature where the fitness of solutions depends on tree structure rather than program execution is SORTING. In the following three subsections, we review the state of the art concerning these problems.
The ORDER problem, as originally introduced by Goldberg and O’Reilly [11], is defined as follows.
, .
The fitness of a tree is the number of literals for which the positive literal appears before the negative literal in the in-order parse of .
In this problem, (for “join”) is the only available function, and the fitness of a tree is determined by an in-order parse of its leaf nodes; this reduces the importance of the tree structure in the analysis, making the representation somewhat similar to a variable-length list. For example, a tree X with in-order parse has fitness because appear before their negations. Obviously the optimal solution is any tree that contains all the positive literals and each negative literal that appears in the tree is preceded by the corresponding positive literal , and thus has a fitness of .
ORDER was introduced as a simple problem that reflects the typical eliminative expressions that take place in conditional statements and other logical elements of computer programs, where the presence of an element determines the execution of a program branch rather than another. The overall idea is that, in ORDER, the conditional execution path is determined by inspecting whether a literal or its complement appear first in the in-order leaf parse. The task of the GP algorithm is to identify and appropriately position the conditional functions to achieve the correct behavior.
Durrett et al. proved that the (1+1) GP can optimize ORDER in expected time where represents the maximum size the evolved tree reaches throughout the optimization process. The exact result is stated in the following theorem.
The expected optimization time of the strictly and non-strictly elitist cases of the RLS-GP and algorithms on ORDER is in the worst case, where is the number of variables and denotes the maximal tree size at any stage during the execution of the algorithm.
The proof idea uses standard fitness-based partition arguments. Given that at most variables are expressed correctly (i.e., the positive literal appears before any instances of the corresponding negative literal in the in-order parse of the GP tree), a lower bound of may be achieved on the probability of expressing an additional literal by an insertion operation given that the GP tree contains exactly leaves. Then by standard waiting time arguments the expected number of iterations to improve the solution is , and the expected time until all literals are expressed is obtained by summing the times.
The runtime bound stated in Theorem 2 depends on the tree size . If, as it often happens in GP applications, a bound on the maximum size of the tree is imposed, then this bound is also a bound on . However, if no restriction on the maximum tree size is imposed, then bounding the maximum size of the tree is challenging. Nevertheless, if strict selection and local mutations are used, then it can be shown that the tree does not grow too much from its initialized size. The following corollary of Theorem 2, which states this result precisely, is slightly more general than the one presented in [7].
The expected optimization time of the RLS-GP on ORDER is if the tree is initialized with terminals.
RLS-GP will only accept mutations which improve the fitness of the current solution, and as there are only possible fitness values, at most mutations can be accepted by the GP before the optimum is found.
A single application of HVL-Prime cannot increase the size of the tree by more than one leaf. Thus, , and applying Theorem 2 yields the desired runtime bound. ∎∎
It is still an open problem to bound for the , or even for RLS-GP where non-strict selection is used. It has been conjectured [7] that the same bound as in Corollary 3 should also hold for the . In general, they note that the acceptance of neutral moves on ORDER causes a “feedback loop that stimulates the growth of the tree”, as there is a slight bias towards accepting insertions rather than deletions on the problem, and larger trees create more opportunities for neutral insertions to take place.
A subsequent experimental analysis performed by Urli et al. led the authors to conjecture an upper bound on the runtime [49], which would imply, if correct, that the bound given in Corollary 3 is not tight.
As shown in the following subsection, by using bloat control mechanisms, more precise results have been achieved by exploiting the more explicit control of the tree size.
The performance of the with lexicographic parsimony pressure on ORDER has been considered by Nguyen et al. [35] and Doerr et al. [5]. This mechanism controls bloat by preferring trees of smaller size when breaking ties amongst solutions of equal fitness.
The Negative Drift Theorem was used by Nguyen et al. to show that as long as the initial tree is not too large (), it does not grow significantly in less than exponential time (i.e., with high probability). With this bound on , it is then proven that the optimum is found in iterations with high probability, showing that the solution can be improved up to times via a cycle of shrinking it down to minimal size (containing no redundant copies of any variable) and then expressing a new variable (pessimistically assuming that this insertion also creates a large amount of redundant terminals in the tree, requiring another round of shrinking to occur prior to the next insertion). Experimental results led to the conjecture of an bound [49].
A more precise analysis proves the bound and its tightness, as given in the following theorem [5].
The with lexicographic parsimony pressure on ORDER takes iterations in expectation to construct the minimal optimal solution.
The lower bound of the theorem is proven by using standard coupon collector and additive drift arguments. For the upper bound, the variable drift theorem [46] is applied using a potential function that takes into account both the number of expressed literals and the size of the tree.
Neumann considered the Pareto parsimony pressure approach to bloat control by introducing a multi-objective GP algorithm (SMO-GP), and using both the solution fitness and size as objectives [34]. This approach was motivated by noting that GP practitioners can, when presented with a variety of solutions, gain insight into how solution complexity trades off against quality.
The SMO-GP algorithm maintains a population of solutions representing the current-best approximation of the Pareto front. Similarly to the , the algorithm produces a single offspring individual by applying the HVL-Prime operator times to a parent individual chosen uniformly at random from in each iteration. If the offspring is not strictly dominated by any solution already in , it is added to the population, while any solutions in it weakly dominates are removed. Thus, the size of the population can vary throughout the run. The theoretical analysis considers the number of iterations required to compute a population containing the entire Pareto front.
The expected optimization time of the SMO-GP, using either or , on ORDER is .
The result is proven by showing that it is possible for the GP algorithms to construct the empty tree in expected iterations. Once a minimal solution with expressed variables exists in the population, the minimal solution with expressed variables can be constructed from it with probability at least in each iteration, and hence an upper bound on the expected runtime may be achieved by using the fitness-based partition method.
Neumann introduced the WORDER problem, a weighted variant of ORDER, where each pair of variables has a corresponding weight and the fitness of a solution is the sum of the weights of all expressed variables [34]. The idea behind the problem was to mimic the generalization of the complexity analysis of evolutionary algorithms from OneMax to the class of linear pseudo-Boolean functions [6, 38]. As RLS-GP is unable to produce offspring that differ by more than two expressed variables, its expected optimization time on WORDER is equal to its expected optimization time on ORDER, leading to an equivalent of Theorem 4 for the RLS-GP with lexicographic parsimony pressure. A bound on the runtime of the is given in the following theorem.
The expected optimization time of the on WORDER is .
The theorem is proven by applying the Multiplicative Drift Theorem, showing that, in expectation, the weight of the unexpressed variables decreases by a constant factor in each iteration. The dependence on is explained by noting that a beneficial mutation requires the insertion of a variable at the beginning of the in-order walk of the tree; unfortunately, can potentially grow to be arbitrarily large.
As for the standard ORDER problems, using the Pareto parsimony pressure approach, results not depending on the maximum tree size may be achieved. Yet, in the multi-objective setting, a special case is considered where the algorithm is initialized with a non-redundant solution (i.e., a solution where no single leaf can be removed without adversely affecting solution fitness). By limiting (i.e., the RLS variant), the algorithm will not accept redundant solutions throughout the optimization process. The following theorem was proven following the approach of Theorem 5.
Starting with an initial solution containing no redundant terminals, the expected optimization time of SMO-GP with on WORDER is .
Both the requirement that the initial tree should be non-redundant, and the restriction to the single-operation local search variant of SMO-GP were removed by Nguyen et al. [35].
Let be the size of the initial solution, and be the maximum size of the SMO-GP population at any point during the optimization process. Then, expected optimization time of SMO-GP on WORDER is when , and when .
Unfortunately, even though the size of the Pareto front is linear, is not a parameter that can be controlled by the user: in the worst case, the population might consist of an individual for every possible fitness value, and on some WORDER instances, this can range up to . Experiments have led to the conjecture that both and grow linearly during the optimization process. However, no rigorous proofs are available [35].
The MAJORITY problem, as originally introduced by Goldberg and O’Reilly [11], is defined as follows.
, .
The fitness of a tree is the number of literals for which the positive literal appears in at least once, and at least as many times as the corresponding negative literal .
In this problem, (for “join”) is the only available function, and the fitness of a tree is determined by an in-order parse of its leaf nodes; this reduces the importance of the tree structure in the analysis, making the representation somewhat similar to a variable-length list. For example, a tree with an in-order parse of would have a fitness of , as only the and literals are expressed (while outnumbers in the tree, and is therefore suppressed). Any optimal solution, expressing all positive literals, has a fitness of .
The fitness of solutions on MAJORITY is based on the quantity of and literals in the tree, with only the literal in greater quantity (majority) being expressed and potentially contributing to the fitness value. This serves to model problems where solution fitness can be accumulated through additions of more nodes to the tree, regardless of their exact positions.
In contrast to ORDER, where there is always a position in the tree where an unexpressed literal can be inserted to express and improve the fitness of a solution, on MAJORITY there exist trees where no single insertion of an unexpressed will lead to being expressed and thus improving fitness, even though all literals can contribute to expressing in aggregate regardless of their position. Thus, GP variants which do not accept neutral moves were found to perform quite badly, with RLS-GP shown to be capable of getting stuck in easily-constructed local optima, and having an exponential expected optimization time to recover from a worst-case initialization [7]. On the other hand, GP variants using non-strict selection may be efficient.
Let denote the maximal tree size at any stage during the execution of the algorithm. Then, the expected optimization time of the RLS-GP on MAJORITY is
in the worst case, where , and is the number of times the literal appears in the initial tree.
If initialized with a random tree containing terminals selected uniformly at random from , the expected optimization time of the RLS-GP on MAJORITY is
The presented bounds depend on , the maximum deficit between the number of positive literals and negative literals of any variable in the tree (thus, a tree with a single copy of and two copies of would have a deficit of ). The worst case result, assuming a deficit of literals for all variables, follows from a generalized variant of the coupon collector problem [33], requiring the collection of copies of each coupon. For the uniform initialization with , a bound on is derived using the balls-into-bins model [28]. It is then proven that a variable which initially has a deficit of becomes expressed after an expected mutations involving that variable (which occur with probability ) by showing that the GP system essentially performs a random walk that is at least fair with respect to decreasing the deficit.
For the , only a hypothetical worst-case analysis for the elitist variant is presented in [7], noting that if the last unexpressed variable has more negative literals than positive literals in the tree, the final mutation will require at least time, and thus unless can be shown to be constant, the expected runtime remains super-polynomial. However, no bounds on the probability that a super-constant would actually occur were given.
The problem, including the dependence on was recently solved, proving the following upper and lower bounds on the expected optimization time [5].
When initialized with a tree containing terminals, the expected optimization time of the RLS-GP and algorithms on MAJORITY is at least and at most
The lower bound is proved by an application of the multiplicative drift theorem with bounded step size, while the upper bound relies on showing that if , the tree will grow by at most a constant factor in generations before the optimal solution is constructed. As a result, the bloat cannot be too excessive throughout the optimization process, implying that the final tree may be at most larger by a multiplicative polylogarithmic factor than the optimal solution size.
From the analysis, an interesting alternative to bloat control emerges. By changing the HVL mutation probabilities such that deletions are more likely than insertions, a drift towards smaller solutions would be observed, leading to smaller trees, and hence faster optimization. Such a suggestion was originally given by Durrett et al., albeit for the ORDER problem [7]. Concerning MAJORITY, theoretical evidence in support of this has emerged, though no formal proof is available [5].
Applying lexicographic parsimony pressure mitigates the analysis problems in the GP systems for MAJORITY. With this bloat control mechanism, mutations which solely remove negated terminals are always accepted, as they reduce the size of the tree. Accepting such mutations eventually leads GP to a solution where fitness can be improved by inserting a positive literal, allowing the optimum to be reached efficiently.
The expected optimization time of the RLS-GP with lexicographic parsimony pressure on MAJORITY, when initialized with a tree containing literals, is
The result is proven by reasoning that it takes iterations to remove the negated terminals provided by a worst-case initialization, and iterations to express all variables by an application of the coupon collector argument.
A tight bound for the , showing that the larger Poisson mutations do not affect the asymptotic run time, has recently been proved [5], confirming a previous conjecture [49].
The expected optimization time of the with lexicographic parsimony pressure on MAJORITY, when initialized with a tree containing literals, is .
The lower bound of the theorem is proven by using standard coupon collector and additive drift arguments. For the upper bound, the variable drift theorem [46]
is applied using a potential function that takes into account both the number of expressed literals and the size of the tree. Intuitively, the size of the tree is only allowed to increase if the MAJORITY fitness is also increased, which can only occur a limited number of times, and the magnitude of the increase is unlikely to be overly large due to the Poisson distribution used to determine
.It is still an open problem to prove whether lexicographic parsimony pressure asymptotically improves the runtime of the or whether the upper bound given in Theorem 11 is not tight (Urli et al. conjecture an upper bound of without bloat control based on experimental data [49]).
Applying Pareto parsimony pressure and treating the size of the tree as an additional objective in the multi-objective SMO-GP algorithm allows the GP system to compute the Pareto front of solutions in terms of fitness/complexity.
The expected optimization time of SMO-GP (with either and ) on MAJORITY, initialized with a single tree containing terminals, is .
The SMO-GP population will contain at most individuals, as there are only distinct fitness values on MAJORITY. Similar to the situation for lexicographic parsimony pressure, SMO-GP is able to construct an initial solution on the Pareto front by repeatedly removing any duplicate or negated terminals from the initial solution. Once a solution on the Pareto front exists, the entire front can be constructed by repeatedly selecting a solution at the edge of the front and expressing an additional variable or deleting an expressed variable.
Neumann [34] introduced the WMAJORITY problem, a weighted variant of MAJORITY, where each pair of variables has a corresponding weight and the fitness of a solution is the sum of the weights of all expressed variables. The idea was again to mimic the generalization of the complexity analysis of evolutionary algorithms from OneMax to the class of linear pseudo-Boolean functions [6, 38]. Results about GP systems without bloat control for WMAJORITY are unknown, though Urli et al. conjecture an upper bound on the runtime of RLS-GP and based on experimental results [49].
Concerning lexicographic parsimony pressure, as RLS-GP is unable to produce offspring that differ by more than two expressed variables, its expected optimization time on WMAJORITY is equal to its expected optimization time on MAJORITY, leading to an equivalent of Theorem 12 (i.e., an runtime bound).
Concerning Pareto parsimony pressure, bounds on the WMAJORITY problem for both the single-operation and multi-operation variants of the SMO-GP were proven, using the size of the tree as an additional objective to minimize [35].
Let be the size of the initial solution, and be the maximum size of the SMO-GP population at any point during the optimization process. The expected optimization time of SMO-GP on WMAJORITY is when , and when .
Unfortunately, is not a parameter that can be controlled by the user: in the worst case, the population might consist of an individual for every possible fitness value, which for WMAJORITY can be exponential with respect to . Experiments have led to the conjecture that grows linearly with the problem size [35].
The SORTING problem is the first classical combinatorial optimization problem for which computational complexity results have been obtained for discrete evolutionary algorithms. For the application of evolutionary algorithms Scharnow et al. defined SORTING as the problem of maximizing different measures of sortedness of a permutation of a totally ordered set of elements
[47].The problem was considered in a GP setting by Wagner et al., aiming to investigate the differences between different bloat control mechanisms for genetic programming [52, 53]. For the GP variant, the measures of sortedness have been adapted to deal with incomplete permutations of the literal set.
, . The fitness of a tree is computed by deriving a sequence of symbols based on their first appearance in the in-order parse of , and considering one of the five measures of sortedness of this sequence:
INV | Number of pairs of adjacent elements in the correct order (maximize to sort), with INV if . |
---|---|
HAM | Number of elements in correct position (maximize to sort). |
RUN | Number of maximal sorted blocks (minimize to sort), plus the number of missing elements , with RUN if |
LAS | Length of longest ascending sequence (maximize to sort) |
EXC | Smallest number of exchanges needed to sort the sequence (minimize to sort), plus if . |
In this problem, (for “join”) is the only available function, and the fitness of a tree is determined by an in-order parse of its leaf nodes drawn from a totally-ordered set of terminals ; this reduces the importance of the tree structure in the analysis, making the representation somewhat similar to a variable-length list. Thus, for , the fitness of a tree with an in-order parse of , and hence is: INV, HAM, RUN, LAS, and EXC. The fitness value of optimal trees for the INV, HAM, and LAS measures is , while for the RUN and EXC measures it is .
Unlike the ORDER and MAJORITY problems considered in the previous sections, the SORTING problem is not separable, meaning that it cannot be split into subproblems that could be solved independently. The dependencies between the sub-problems can thus significantly impact the overall time needed to solve the optimization problem, and the variable-length representation of solutions can create local optima from which it is difficult for the GP systems to escape. Wagner et al. additionally remark that the task of evolving a solution is more difficult for the considered RLS-GP and systems than for the permutation-based EA, which in expectation requires iterations for the INV, HAM, LAS or EXC sortedness measures, and exponential time when using the RUN sortedness measure [47].
The expected optimization time for the RLS-GP and algorithms on SORTING using INV as the sortedness measure is , where is the number of elements to be sorted, and is the maximum size of the tree during the run of the algorithm.
For the HAM, RUN, LAS and EXC measures, there exist initial solutions with terminals such that the expected optimization time of RLS-GP is infinite, and the expected optimization time of is .
The positive statement is proven by applying the artificial fitness level method, observing that there are possible fitness values, and with probability a mutation inserts a literal which corrects at least one unsorted pair without introducing any additional unsorted pairs.
For the HAM, RUN, LAS, and EXC measures, trees which require large mutations to improve fitness exist, which causes the expected optimization time to be infinite for the RLS-GP, and for the . In general, the problematic solutions contain a large number of copies of a single literal in an incorrect location, and a large sorted sequence, requiring either all the incorrectly placed copies to be removed simultaneously, or the sorted sequence to be moved in a single mutation.
When bloat control mechanisms are applied, the GP systems may reduce the size of the redundant components of the solution even if mutations which make progress in this direction do not alter the solution’s sortedness measure.
The impact of applying lexicographic parsimony pressure for the family of algorithms, and of Pareto parsimony pressure for the SMO-GP algorithms has been considered [52, 53]. We summarize the results in Table 1.
No Bloat Control | Parsimony Pressure | |||
---|---|---|---|---|
F(X) | RLS-GP | RLS-GP | SMO-GP | |
INV | [2] | [2] | [2] | [2] |
LAS | [2] | [2] | [2] | [2] |
HAM | [2] | [2] | [1] | [1] |
EXC | [2] | [2] | [1] | [1] |
RUN | [2] | [2] | [1] | [1] |
In general, the positive results are proven by showing that there exists a sequence of fitness-improving mutations leading the GP system to the global optimum (in the case of algorithms), or, for the SMO-GP, to a solution on the Pareto front, from which other Pareto front solutions can be constructed efficiently.
The majority of the negative results rely on showing the existence of local optima for the sortedness measure, which limits the availability of results for the non-strictly elitist algorithms, and especially for , which is capable of performing larger mutations.
Interestingly, the results in Table 1 suggest that the variable-length representation can cause difficulties for the RLS-GP even when parsimony pressure is applied for some simple measures of sortedness, while even a simple multi-objective algorithm is able to find the entire Pareto front of the problem efficiently when using any of the five considered measures.
Experimental results presented suggest that the algorithm is efficient (i.e. able to find the optimum in polynomial time) using all of the considered sortedness measures except RUN, both with and without bloat control mechanisms: concerning the average case complexity, an bound is conjectured for INV and LAS measures, and an bound is conjectured for the EXC and HAM measures [53]. Providing a rigorous theoretical analysis of the GP systems’ behavior remains an open question.
In this section, we consider two more advanced applications compared to those of the previous section. For both problems, the fitness of an evolved program is computed by evaluating its output. While more realistic, the problems are still different from proper GP applications. In the first problem, MAX, the program to be evolved has no input variables, and thus the GP system has to construct a program which always outputs the same constant value, subject to constraints on problem size and available operators. Concerning the second problem, an identification one, the structure of the optimal solution is fixed (i.e., no tree structure has to be evolved), and the considered GP system is not allowed to deviate from it, but must instead learn the exact weights of a predefined linear function while evaluating program quality by comparing the program output to the target function on only a limited amount of the possible function inputs.
The MAX problem was originally introduced by Gathercole and Ross as a means of analyzing the limitations of crossover when applied to trees of fixed size [10]. The fitness of the program depends on the evaluation of the arithmetic expression represented by the tree. However, the problem contains no variable inputs, and thus the goal of the GP algorithms is simply to construct a tree that evaluates to the maximum possible value subject to the restrictions on the size of the tree, and the available functions and terminals.
, , a positive constant, and maximum tree depth .
The fitness of a tree is the value produced by evaluating the arithmetic expression represented by the tree if the tree is of depth at most , and 0 if the tree is of larger depth.
The optimal solution to MAX is a complete binary tree of depth , with at all the leaves, with the lowest levels of interior (i.e., non-leaf) nodes containing and the remaining interior nodes containing . It has been noted that lower values of make the problem more difficult for crossover-based GP systems [10].
The behavior of GP systems on the MAX problem was previously studied experimentally, with Langdon and Poli observing that MAX is hard for GP systems utilizing crossover due to the interaction of deception with the depth bound on the tree making it difficult to evolve solutions, with the GP systems essentially being forced to perform randomized hill climbing in the later stages of the optimization process, and hence requiring exponential time with respect to the maximum allowed depth of the trees [19].
A theoretical analysis of the on the MAX problem was presented by Kötzing et al., showing that the runtime of the mutation-only algorithm was polynomial with respect to , the maximum allowed number of nodes in the tree.
The RLS-GP algorithm finds the optimal solution on the MAX problem for any choice of , in expected iterations, where is the maximum allowed number of nodes in a tree subject to the depth limit .
The theorem is proven by showing that the GP algorithm can first construct a complete binary tree with depth in a way that prevents any node from being deleted, and then use the substitution sub-operation of HVL-Prime to correct internal nodes.
Concerning the , a weaker bound on the expected runtime was proven.
The expected time for the to find the optimal solution for the MAX problem with is .
The theorem is proved using fitness-based partitions, exploiting the existence of at least one leaf in a tree of size which could be selected by insertion to grow the tree. Experimental results suggesting that the true runtime of the on MAX is also were also presented, and the authors note that a more precise potential function based on the contents of the tree would be required to show this upper bound using drift analysis.
Additionally, a modification of the insertion operation in HVL-Prime to grow the tree in a more balanced fashion was considered: rather than selecting a location to insert a new leaf node at uniformly at random from the entire tree, selection would pick a leaf at depth with probability to replace with a new function node, using the original leaf and an inserted terminal as its leaves. As well as balancing the growth of the tree between different branches, this reduces the probability that mutation attempts insertion operations which would be blocked by the tree depth limit. With this modified insertion operator, an bound on the expected runtime of the on MAX with was proved [15].
Closing the gap between the upper bound for the on MAX with and an lower bound given by a coupon collector argument remains an open problem. Furthermore, theoretical time complexity analyses of the performance of crossover-based GP systems, for which the MAX problem was originally introduced, are still unavailable.
The identification problem was introduced by Kötzing et al. [14], to evaluate the learning capabilities of a simple evolutionary algorithm (EA with a local mutation operator) in the setting of the approximately correct (PAC) learning framework [50]. The idea is that while some problems cannot always be solved exactly (as there might be no known polynomial-time algorithms producing an exact solution, as, e.g., for NP-hard problems), a good approximation, i.e. one that is correct on a random input with high probability, may be a satisfactory solution. A large class of functions has been shown to be PAC learnable by designing appropriate evolutionary algorithms [51, 8]. Compared to these works, Kötzing et al. consider a simplified setting [14]. Unlike the previously considered problems, the structure of the desired solution is known in advance by the algorithm, which is tasked with identifying the target function from a known class of linear functions; more precisely, the identification problem is that of learning a linear function defined over bit strings ,
where .
The goal of the considered EA (called the Linear GP algorithm) is to identify whether each weight is positive or negative. The algorithm changes a single weight in each iteration, and determines whether the mutated offspring has better fitness than its parent using a multi-set constructed independently in each iteration by selecting the desired number of points uniformly at random (with replacement) from , and computing an error of each solution as :
preferring solutions with lower error.
Thus, the focus of the analysis is to measure the ability of the GP system to extract information from a limited view of the true fitness function: if is too small, the sampled error function may be an unreliable indication of the true quality of the solution. On the other hand, larger require more computational effort for each fitness evaluation, which could result in worse performance with respect to the overall CPU time spent.
The following theorem shows that the Linear GP algorithm is able to learn efficiently if the number of inputs sampled in each iteration is sufficiently large.
If , a large enough constant, the expected number of generations until the best-so-far function found by Linear GP has an expected error is .
If also has a linear number of both and weights, the expected number of generations until such a solution is found is .
In this setting, implies that an optimal solution has been found, and thus the theorem additionally provides an bound on the expected number of generations required to learn the perfectly (by setting ). The theorem is proven by showing that in generations, the numbers and of incorrect weights in set to and respectively becomes balanced (such that there is at most one more incorrect weight of one kind versus the other) with high probability, and remains balanced throughout the rest of the process. When , mutations increasing either value are rejected with high probability, while mutations reducing either value are accepted with high probability (but can be undone by the GP system until a wrong weight of the opposite kind is corrected). Thus, and can be reduced permanently by performing the two reductions in sequence (which occurs with probability at least if initially ), and by a coupon collector-like argument, the number of incorrect weights is reduced to an acceptable level in expectation after generations.
Extending the analysis results to broader function classes and algorithms, e.g. considering functions with more than two options for each coefficient, or a -like mutation operator capable of performing more than one change in each iteration, remains an open direction for further research. The PAC-learning framework will also be used to analyze the performance of the family of algorithms on Boolean functions in the next section.
The problems of evolving Boolean functions of arity , such as conjunctions (AND) or parity (XOR), have long been used as benchmarks in the field of GP [16, 19], and are well-understood in the PAC-learning framework [51] – conjunctions are evolvable efficiently, while parity problems are not. Unlike the problems considered in the previous sections, Boolean functions have a clear input/output behavior, allowing a natural definition of a fitness function related to program inputs and outputs, and can naturally support larger function and terminal sets.
A complexity analysis of the (1+1) GP algorithms on the AND and XOR problems has recently been presented [25]. Common to both problems, the GP algorithms are initialized with an empty tree, as larger trees are helpful for the easiest case of the AND problem.
For these problems, the fitness of the evolved solutions is evaluated by comparing their output to that of the target function on either the entire truth table, or a polynomial training subset. If an incomplete training set is used, the GP system may either choose it once at the beginning of the run (the static incomplete training set case as considered in [25]), or choose a fresh subset dynamically in every iteration (as in [21]). Both approaches may be valid in different practical settings: if the complete truth table is known but is prohibitively large, it may be sampled to produce an estimated fitness of a solution, while if only a limited number of input/output examples are available, some may need to be held back to validate the quality of the solution on inputs that it has not been trained on.
The AND problem, in its simplest form, tasks the GP with evolving a conjunction of all available input variables.
, .
The fitness of a tree using a training set selected from the rows of the complete truth table is the number of training set rows on which the value produced by evaluating the Boolean expression represented by the tree differs from the output of the conjunction of all inputs. This fitness value should be minimized; the optimal solution has a fitness of .
For example, when the complete truth table is used as the training set , the fitness of a tree containing only a single leaf on the AND problem with is , while the fitness of the optimum is (as the fitness function represents the error of the solution on the training set). In general, a conjunction of distinct variables has a fitness of on the complete training set. This fitness function is unimodal, making the AND problem somewhat similar to the OneMax benchmark problem for evolutionary algorithms: the GP system simply has to collect all distinct variables in its solution, with the fitness of the current solution improving with each distinct variable added.
Mambrini and Oliveto show that the RLS-GP and RLS-GP algorithms can efficiently construct the optimal solution on the AND problem when using the complete truth table to evaluate solution fitness [25].
The expected optimization time of RLS-GP and RLS-GP on the AND problem using the complete training set is . The solution produced by RLS-GP contains exactly terminals.
The proof applies a coupon collector argument, showing that with probability , a new variable is added to the solution, and that no mutations decreasing the number of distinct variables are ever accepted. As all interior nodes are forced to be conjunctions, collecting all variables in the tree produces an optimal solution.
The following theorem presents a fixed budget analysis of the RLS-GP and RLS-GP algorithms, providing a relationship between the expected number of distinct variables in the solution and the time the algorithms are allowed to run.
Let denote the number of distinct variables in solution , and ( respectively) be the solution produced by the RLS-GP (RLS-GP) algorithm given a budget of iterations on the AND problem using the complete training set when initialized with an empty tree. Then,
The theorem is proven by following the techniques used to analyze Randomized Local Search on the OneMax problem in [13]. The exact expectation is known for RLS-GP, which never accepts solutions that do not improve fitness, and hence can never have a substitution sub-operation increase the number of distinct variables in the solution. The upper and lower bounds on for the RLS-GP stem from trivial bounds on the probability of a substitution sub-operation of HVL-Prime increasing the number of distinct variables in the solution. We note that although the relationship between the solution fitness () and the number of distinct variables it contains () is known, it is not possible to apply linearity of expectation to transform a bound on into a bound on (as could be done for OneMax).
The runtime analysis results have been extended to cover the algorithms, and show that the expected number of terminals in the constructed solution is .
The expected optimization time of the and the on the AND problem using the complete training set is . In expectation, the solution produced by the algorithms contains terminals.
On the AND problem, there are many possible trees which encode the desired behavior (as repeating a variable multiple times in the conjunction does not negatively affect the behavior of the program) and it is therefore possible that a “correct” program could contain much more than the required leaf nodes. The space complexity result in Theorem 25 shows that the considered GP systems construct a tree that in expectation contains just leaf nodes. It is proven by showing that the number of terminals containing variables present in the solution multiple times does not grow fast enough to affect the asymptotic size bound in the iterations required to collect all required variables with high probability.
Obviously, the complete truth table for the AND problem contains rows in total. Hence, in practice it is not possible to evaluate the exact fitness of a candidate solution.
If the training set was restricted to be polynomial in size, and were to be chosen uniformly at random from the complete truth table, then with high probability, a solution representing a conjunction of a logarithmic number of distinct variables will be correct on all of the inputs included in the training set, causing the optimization process to end prior to finding a solution that is correct on all possible inputs [25]. The following result holds in both when the training set is sampled once and for all at the beginning of the run (i.e., a static training set), and when at each generation a new training set is sampled (i.e., a dynamic training set).
Let be the size of a training set chosen from the truth table uniformly at random with replacement. Then, both the RLS-GP and the RLS-GP will fit the training set on the AND problem in expected time ; with the solution containing at most variables.
This result is proven by observing that rows selected uniformly at random from the truth table are unlikely to assign more than input variables to true, and hence can be satisfied by inserting any one of a linear number of variables into the solution; after successful insertions, the probability that some row of the -row training set is still not satisfied is at most , and hence in expectation the process satisfies all rows after distinct variables are successfully inserted into the tree.
Theorem 26 also yields a lower bound on the generalization error of the solution: if it contains at most variables, the probability that its output is wrong on a truth table row sampled uniformly at random is , i.e., requiring in expectation a polynomial number of samples taken uniformly at random from before a divergence from the target function is discovered.
Theorem 26 has been extended to cover the and algorithms, using the Multiplicative Drift Theorem to provide a runtime bound on the expected time to fit the static polynomial-sized training set [21]. Additionally, a similar bound holds if instead of a static training set, each iteration samples independent rows of the complete truth table to compare the fitness of two solutions (using a dynamic training set).
Let rows from the complete truth table of the AND problem be sampled with replacement and uniformly at random in each iteration (where and are any constants). Then, RLS-GP, RLS-GP, , , will construct a solution with a generalization error of at most in expected iterations. In expected iterations, the non-strictly elitist algorithms will construct a solution with a sampled error of .
Here, the training set size is chosen to be sufficiently large to ensure that solutions with a generalization error greater than are wrong on at least one training set row with high probability, preventing the GP system from terminating with a bad solution early, while the runtime bound stems from a random walk argument pessimistically considering the probabilities of accepting solutions increasing or decreasing the number of distinct variables in the tree being equal.
While the AND problem uses minimal function and terminal sets necessary to represent the optimal solution, both sets can be enlarged to represent a lack of knowledge regarding which components are actually necessary in order to solve a problem. These extensions lead to considerably more realistic applications of GP.
The AND problem is a variant of the AND problem in which the target function is a conjunction of distinct variables from the terminal set . This is similar to the setting considered by [51], and has been analyzed for the RLS-GP algorithms in [21], where the RLS-GP and RLS-GP (while disallowing the HVL-Prime substitution sub-operation) algorithms are able to construct the optimum solution on the AND problem in an expected iterations, while the canonical RLS-GP will with high probability fail to find the optimum.
The RLS-GP algorithm, and the RLS-GP algorithm (without the substitution HVL-Prime operation) find the optimum on the AND problem in expected iterations when using the complete training set.
The RLS-GP algorithm (with the substitution operation) will with high probability fail to find the optimum on the AND problem when for any constant when using the complete training set.
The analysis relies on showing that initially, inserting both variables that are present in the target function (“correct” variables), and those that are not (“incorrect” variables), is beneficial for the fitness value of the candidate solution, while removing incorrect variables only becomes beneficial after all correct variables are present in the current solution. With local search mutation and the substitution sub-operation of HVL-Prime, it is possible for the RLS-GP to accept a solution which substitutes the last copy of some incorrect variable with another copy of a still-present incorrect variable in the solution. If this occurs, RLS-GP will not be able to reach the global optimum, as no single application of HVL-Prime is capable of removing a leaf node containing an incorrect variable present multiple times in the current solution while improving fitness.
It is conjectured that a similar bound holds for the runtime of the and algorithms, which are able to introduce and remove duplicate terminals in the solution using larger mutation operations.
The function set could also be enlarged by introducing additional Boolean operators, such as OR or NOT, aiming to provide the GP with the expressive power necessary to represent any Boolean function. Mambrini and Oliveto have shown that if the unary NOT operation is introduced (by extending the set of literals with negated versions of each variable, avoiding the need to modify the HVL-Prime mutation operator to deal with non-binary functions), the RLS-GP algorithms are no longer able to efficiently construct the optimum solution on the AND problem using the complete training set [25]; this was extended by Lissovoi and Oliveto to cover the algorithms [21].
The RLS-GP, RLS-GP, and algorithms on the AND problem with do not construct an optimal solution in polynomial time with overwhelming probability when using the complete training set.
This result follows from the observation that a solution containing a conjunction of both a variable and its negation always evaluates to false, and hence has a nearly-optimal fitness value of (i.e., it is wrong on just one of possible inputs). Such a pair of literals is shown to be present in the current solution with overwhelming probability once it contains distinct literals. For the strictly elitist GP algorithms, reaching the global optimum would then require a large simultaneous mutation with an exponential waiting time, while the non-strictly elitist GPs need to essentially perform a random walk in dimensions and reach a particular point while receiving little guidance from the fitness function.
Additionally, even if the GP systems could be prevented from accepting any solution containing a contradiction (for instance, by weighing the all-true variable assignment much higher than any other input), the RLS-GP and algorithms still require exponential time to find the global optimum, as all non-optimal solutions containing all variables (in either the positive or negated form) share the same fitness value (, being wrong on the all-true input, and the single assignment satisfying the solution but not the target function), and it the closer the GP system is to having all positive literals, the more likely mutation is to produce an offspring which replaces a positive literal with a negative one.
From a problem hardness perspective, it was shown that there exist small training sets of rows which allow the RLS-GP and algorithms to find exact solutions (with a generalization error of 0) to the AND, the AND, and the AND (with NOT) problems efficiently. In general, identifying such training sets may be non-trivial.
Let be an -row training set, where row sets to false and all (where ) to true and be a -row training set containing all the rows of and copies of the row setting all inputs to true. The RLS-GP and algorithms using the training sets ( respectively) are able to find the exact solution of AND and AND with , (AND with and ) in expected fitness evaluations (or training set row evaluations).
In the case of the NOT-extended AND problem, a variant of the which maintains and randomly selects from a population of individuals subject to a diversity mechanism prohibiting multiple solutions with identical outputs on the training set was proven to find an optimal solution in iterations on an row training set (consisting of all the inputs in and an input where all the variables are set to true) [21].
The XOR problem is that of evolving an exclusive disjunction of all input variables.
, .
The fitness of a tree using a training set selected from the rows of the complete truth table is the number of training set rows on which the value produced by evaluating the Boolean expression represented by the tree differs from the output of the exclusive disjunction of all inputs.
When using the complete truth table as the training set, the fitness of any non-optimal solution is , while the fitness of the optimal solution is . Thus, using the complete training set on XOR is similar to the Needle benchmark problem; Langdon and Poli note that “the fitness landscape is like a needle-in-a-haystack, so any adaptive search approach will have difficulties” [19], and the problem is known not to be evolvable in the PAC-learning framework [51].
Predictably, the RLS-GP and algorithms are not able to optimize XOR efficiently. Strictly elitist variants of GP algorithms will not move from their initial solution unless the optimum is constructed directly, which is typically not possible for the RLS-GP, and occurs in expected exponential time for the , which needs to essentially construct the complete function in one mutation consisting of at least HVL-Prime sub-operations. When using the complete training set, the expected optimization time for the RLS-GP is exponential, as the algorithms accepts any and all mutations, while reaching the optimal solution requires all
variables to appear an odd number of times in the solution
[25].The RLS-GP using and to evolve XOR using the complete truth table as training set requires more than iterations with probability to reach the optimum.
The theorem is proven by an application of the simplified negative drift theorem, showing that when the number of variables that appear in the current solution an odd number of times is large, there is a strong negative drift toward reducing this number, and the optimum requires all distinct variables to appear an odd number of times in the solution. The negative drift stems primarily from the HVL-Prime insertion operator: if a large number of variables is represented an odd number of times, it is more likely to insert one of these variables when choosing a terminal uniformly at random.
While sampling solution fitness using a polynomial number of complete training set rows is also possible on XOR, the outcome is generally underwhelming: if only a logarithmically small number of training set rows are sampled in each iteration, the algorithm will in expected polynomial time terminate with a non-optimal solution that fit the sampled training set, while using training sets of super-logarithmic size will lead to super-polynomial optimization time. Thus, in any polynomial amount of time, the expected generalization ability of the considered GP systems on XOR is , i.e., requiring in expectation a constant number of samples taken uniformly at random from before a divergence from the target function is discovered.
There is also a straightforward extension of Theorem 32 to dynamic training sets of polynomial size, as such sampling provides no consistent indication of fitness.
The RLS-GP and algorithms sampling rows of the complete truth table in each iteration on XOR with and with high probability do not reach the optimum in polynomial time.
The RLS-GP and algorithms will accept any non-optimal offspring of a non-optimal parent with probability at least , as both the offspring and the parent are wrong on inputs, and there are exactly as many rows on which the offspring is correct while the parent is wrong as the converse, and the offspring is accepted in cases of tied fitness.
With rows sampled uniformly at random in each iteration, the probability that a non-optimal solution is correct on all sampled rows is , and by a straightforward union bound, the GP algorithms do not terminate within polynomial time unless the optimal solution is found.
With the exception of any iterations in which the offspring individual is rejected, the algorithms behave identically to the RLS-GP and algorithms using the complete truth table to evaluate solution fitness (i.e., accepting offspring regardless of the effects of mutation), and thus cannot achieve better performance than these algorithms in terms of the number of iterations performed.
Theorem 32 only provides a runtime bound for the RLS-GP. A similar result for the can be obtained by observing that performs in expectation two HVL-Prime sub-operations in each iteration, and hence even if the algorithm terminated immediately upon constructing the optimal solution (even if this occurred in the middle of a mutation), it would in expectation be only a constant factor faster than RLS-GP in terms of the number of iterations required to find the optimum. ∎∎
The previous sections covered the available theoretical results for standard tree-based GP systems, which constitute the majority of GP theoretical complexity analysis results. In this section, we present a slightly different approach to GP system design, which aims to evolve programs semantically rather than syntactically.
Standard tree-based GP evolves programs by applying mutation and crossover to their syntax. Programs that are considerably different syntactically may produce identical output while introducing minimal syntactic mutations may completely change the output of a program. Moraglio et al.[30] introduced Geometric Semantic GP (GSGP) with the aim of focusing GP search on program behavior. In particular, GSGP mutation and crossover operators modify programs in a way that allows the GP system to search through the semantic neighborhood (which consists of programs with similar behavior) rather than their syntactic neighborhood (which consists of programs with similar syntax).
GSGP generally uses a natural program representation for the domain at hand (e.g., representing programs using Boolean expressions when a Boolean expression is to be evolved), and uses specialized semantic mutation and crossover operators to produce offspring programs with behavior similar to that of their parents. These operators generally reproduce the parent programs in their entirety, adding to them to modify their behavior in a limited fashion. For example, the GSGP mutation operator might produce an offspring which contains an exact copy of its parent and a random element which overrides some portions of the parent’s behavior, while the GSGP crossover operator could construct an offspring containing exact copies of both parents and a random element which switches between the two behaviors depending on the inputs. As both operators increase the size of the programs by adding additional syntax to the parent programs to encode the chosen random components (and the crossover includes exact copies of both parents), the programs produced by these operators need to be simplified in order for the algorithm to remain tractable. For some domains, like Boolean functions, quick function-preserving simplifiers exist, while computer algebra systems and static analysis can be used to simplify more complex expressions and programs [30].
Initial experimental results suggest that GSGP consistently finds solutions that fit the training sets used for a wide array of simple Boolean benchmark functions, regression problems for polynomials of degree up to 10, and various classification problems, outperforming standard tree-based GP with the same evaluation budget [30]. Semantic geometric crossover and mutation operators have been designed for many problem domains, including regression problems [31]), learning classification trees [24], and Boolean functions [32]. In these papers, theoretical guarantees are derived regarding the number of generations it takes GSGP to construct a solution fitting the training set (or achieving an -small training set error in the case of regression problems). In this section, we explore the theoretical results focusing on the latter setting: applying geometric semantic search to evolving Boolean functions.
In the case of Boolean functions, the program semantics can be represented by the
-row output vector, corresponding to the program output on all rows of the complete
-variable truth table. In this setting, the semantic crossover operator SGXB, acting on two parents and , produces an offspring solution , where is a randomly-generated Boolean function. This offspring outputs the solution produced by if evaluates to true, and the solution produced by if evaluates to false, effectively performing crossover on the -row output vectors of the two parent solutions. The semantic mutation operator SGMB, acting on a single parent , produces the offspring with probability 0.5, and with probability 0.5, where is a random minterm (a conjunction where each variable appears either in positive or negated form) of all input variables. This effectively copies the output vector of , setting the rows on which evaluates to true to either true or false.These operators allow GSGP to always observe a cone fitness landscape on any Boolean function, i.e., the mutation operator is always able to improve the behavior of the parent program. This allows mutation-only GSGP to hill-climb its way to the optimal program for any function in this domain. However, as the output vector contains rows, hill-climbing by applying SGMB, which only affects one row per iteration, would take iterations (by the coupon collector argument, or similarly to Randomized Local Search on a -bit OneMax function).
For GSGP on any Boolean function, a polynomially-sized training set can be viewed as a OneMax problem on a -bit string where only a polynomial number of bits are non-neutral (i.e., contribute to the solution’s fitness). In that setting, the runtime can be improved by allowing mutations to flip more than one bit of the output vector per iteration (e.g. such that in expectation one non-neutral bit is affected per iteration). This setting is explored in [32], with various approaches to the design of mutation operators, establishing a hierarchy of operator expressiveness (based on how much of the search space they enable the GP system to explore), and considering the probability of fitting a training set of polynomial size. Their results show that while the Varying Block Mutation (VBM) operator, which in each iteration draws an incomplete minterm of variables chosen uniformly at random in each iteration (where is a parameter), is more expressive than Fixed Block Mutation (FBM), which picks the variables once during the run, or Fixed Alternative Block Mutation (FABM), which partitions the variables into sets, and forms the minterm by picking a variable from each set uniformly at random in each iteration, there nevertheless exist training sets which GSGP using VBM cannot fit in any amount of time. Conversely, they also prove that the less-expressive FBM operator can with high probability fit a training set of polynomial size sampled uniformly at random from the complete truth table of any Boolean function.
Let a training set consist of rows, with a positive constant, sampled uniformly at random from any problem . Then, GSGP using the Fixed Block Mutation (FBM) operator with is able to fit with probability at least (for any ); conditioning on this, a function fitting the training set is found in an expected iterations.
This result is proven by observing that FBM’s initial choice of variables (to use as the basis for the minterms) partitions the row output vector of into blocks of equal size, each corresponding to a particular minterm of the variables. Choosing partitions the output vector into more than blocks, ensuring that with high probability all training set rows (chosen uniformly at random from the complete truth table) are in different blocks, and thus the training set can be satisfied by collecting the exact minterms corresponding to the blocks which contain the training set rows. When this condition holds, the expected runtime is obtained by a Coupon Collector argument.
Of course, if FBM chooses the variables poorly with respect to the training set (meaning that at least two training set rows demanding different output are contained in the same block), GSGP will not be able to fit the training set. More expressive operators such as FABM or VBM can minimize this probability at the cost of a mild runtime penalty by allowing the mutation operator to be more flexible when choosing which variables to use as the basis for the minterm (e.g. increasing the runtime by a factor of , but improving the success probability from to where is number classes in the partition created by FABM).
There are also modifications of the GSGP mutation operators that are able to cover the entire search space of programs, eliminating the possibility of failure. There exist classes of Boolean functions on which such operators are effective, allowing the GSGP to fit any training set in expected polynomial time, as shown in the following theorem.
Let be a disjunctive normal form (DNF) formula with conjunctions, every conjunction containing at most variables. Then can be obtained by GSGP with Multiple Size Block Mutation (MSBM) in expected iterations, i.e., polynomial time.
The MSBM mutation operator is a modification of the VBM variant of the SGMB operator. It samples an integer between and , selects variables from the set of input variables, and then generates uniformly at random an incomplete minterm of these variables. This modified mutation operator essentially allows each clause of the target function to be “fixed” in the current solution in expected polynomial number of iterations.
At present, there is no theoretical analysis of how the functions produced by GSGP generalize to unseen inputs. The issue has been considered experimentally [43, 42, 29], with results suggesting that while the initially proposed geometric semantic crossover and mutation operators often achieve poor generalization despite good training set performance, other variants of the semantic operators and algorithm components may be able to achieve better generalization performance.
We have presented an overview of the available results in the computational complexity analysis of GP algorithms. The results follow the blueprint suggested by Poli et al., starting with the analysis of simple GP systems based on mutation and stochastic-hill climbing on simple problems [45]. The complexity of the problems has slowly increased, from the analysis focusing on the main characteristic difficulties of GP (i.e., variable solution length, and solution quality evaluations) to more recent results considering the evolution of functions with true input/output behavior and using realistically constrained fitness functions. The approach of gradually expanding the complexity of analyzed systems was also endorsed by Goldberg and O’Reilly, who stated that the “methodology of using deliberately designed problems, isolating specific properties, and pursuing, in detail, their relationships in simple GP is more than sound; it is the only practical means of systematically extending GP understanding and design” [11].
The GP systems considered for theoretical analysis have remained relatively simple: applying HVL-prime mutation and limited, if any, populations with no crossover are a common setting. In many cases, the analysis for the positive runtime results is only made tractable because “the fitness structure of the model problems is simple, and the algorithms use only a simple hierarchical variable length mutation operator” [7]. In particular, variable length representations and bloat often complicate the analysis of GP systems, and require “rather deep insights into the optimization process and the growth of the GP-trees” [5].
For GP systems utilizing geometric semantic mutation and crossover operators, analyses of the time required to produce a solution fitting the training set are available for wider classes of functions, and frequently do not require insight into the structure of the considered function. However, theoretical analyses of how well the GSGP solutions generalize – how well they perform on inputs not included in the training set – remain a challenge.
While the presented results represent first steps in rigorously analyzing the behavior of GP systems, bridging the gap to the GP systems used in practice requires analyzing more complex GP algorithms on more realistic problems. Thus, extending the presented results to broader classes of problems (for instance, those allowing more flexibility in program behavior), to other problem classes on which GP experimentally performs well (such as symbolic regression), and to more realistic GP algorithms (introducing populations and crossover) are the main directions for further research.
Acknowledgements Financial support by the Engineering and Physical Sciences Research Council (EPSRC Grant No. EP/M004252/1) is gratefully acknowledged.
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming.
In: Proceedings of the 3rd International Conference on Parallel Problem Solving from Nature (PPSN 1994), pp. 312–321 (1994)Jansen, T., Zarges, C.: Performance analysis of randomised search heuristics operating with a fixed budget.
Theoretical Computer Science 545, 39–58 (2014)
Comments
There are no comments yet.