To solve large and complex problems, scalability is among the primary concerns of an optimization practitioner. However, only few studies [18, 19] exist that study scalability in genetic programming (GP)
. The same holds for simple approaches to using probabilistic recombination in GP within the estimation of distribution algorithm (EDA) framework[12, 9, 14], such as the probabilistic incremental program evolution (PIPE) .
The purpose of this paper is to study the scalability of standard GP and PIPE on two decomposable GP problems: ORDER and TRAP. The two algorithms perform as expected and they solve ORDER scalably while failing to scale up on TRAP. Additionally, the paper studies the effects of introducing unnecessary and irrelevant primitives. Both GP and PIPE are shown to deal with these two sources of difficulty well. The results presented in this paper confirm that binary-string GAs have a lot in common with GP and PIPE, and thus the lessons learned in the design, study, and application of standard GAs and their extensions should carry over to GP as argued for example in [6, 18, 19].
The paper starts by describing the algorithms investigated in this paper: GP and PIPE. Section 3 explains test problems. Section 4 provides and discusses experimental results. Section 5 presents important topics for future work in this line of research. Section 6 summarizes the paper. Finally, Section 7 concludes the paper.
Both GP and PIPE work with programs encoded as labeled-tree structures and both can be applied to the same class of problems. While GP generates new candidate programs using standard variation operators, such as crossover and mutation, PIPE builds and samples a probabilistic model in the form of a tree of mutually independent nodes. Therefore, the difference between GP and PIPE is in their variation operator (see Figure 1).
This section describes GP and PIPE. The section starts by discussing standard GP and closes by describing the probabilistic algorithm PIPE.
2.1 Genetic Programming
Genetic programming (GP)  is a genetic algorithm (GA)  that evolves programs instead of fixed-length strings. Programs are represented by trees where nodes represent functions and leaves represent variables and constants.
GP starts with a population of random candidate programs. Each program is evaluated on a given task and its fitness value is assigned. A population of promising programs is then selected using one of the standard GA selection operators, such as tournament or truncation selection. Some of the selected programs can be directly copied into the new population, the remaining ones are copied after applying variation operators, such as crossover and mutation. Crossover usually proceeds by exchanging randomly selected subtrees between two programs, whereas mutation usually replaces a randomly selected subtree of a program by a randomly generated one. This process is repeated until termination criteria are met.
Since standard GP variation operators proceed without considering interactions between different components of selected programs, they are likely to experience difficulties with solving problems where different program components interact strongly. However, problems that can be decomposed into subproblems of order one should be easy for any standard GP based on recombination. This intuition is verified with experiments in Section 4. Similar behavior can be observed in GAs; GAs with standard variation operators work great on problems with no interactions between decision variables [13, 7, 5], but they often fail for problems with highly interacting decision variables [20, 5].
We implemented GP using the lilgp GP library developed by the Genetic Algorithms Research and Applications Group (GARAGe) at the Michigan State University.
In the probabilistic incremental program evolution (PIPE) algorithm [16, 17] computer programs or mathematical expressions are evolved like in GP . However, pairwise crossover and mutation are replaced by building a probabilistic model of promising programs and sampling the model.
Like GP, PIPE represents programs by labeled trees where each internal node represents a function and each leaf represents a variable or a constant. The initial population is also generated at random. All programs in the population are then evaluated and selection is applied to select the population of promising programs. Instead of applying crossover and mutation to a part of the selected population to generate new programs, PIPE now builds a probabilistic model of the selected programs in the form of a tree. This probabilistic model is then sampled to generate new candidate programs that form the new population. The process is repeated until the termination criteria are met.
Next, the methods for learning and sampling the probabilistic model in PIPE are described.
2.2.1 Learning the Probabilistic Model
The probabilistic model in PIPE is a tree with the structure corresponding to the structure of candidate programs. Since different programs may be of different structure and size, the population is first parsed to find the smallest tree that contains every structure in the selected population. Each node of a program in the selected population then directly corresponds to one node in the model, whereas the children of each internal node represent arguments of the function in this node. Figure 2 illustrates probabilistic models used in PIPE.
If there are functions of different arities, the number of children of each node in the probabilistic model is equal to the maximum arity of a function in this node in the selected population. For a function of smaller arity, the first children are interpreted as arguments of this function (in an arbitrary fixed ordering).
PIPE then parses the selected population and computes the probabilities of different functions and terminals in each node of the probabilistic model. The nodes of the probabilistic model thus consist of tables of probabilities, and there is one probability for each function or terminal in each node.
2.2.2 Sampling the Probabilistic Model
Sampling of the probabilistic model starts in the root of the probabilistic model. The same recursive procedure is used to generate each node. First, a function or terminal is generated in the current node based on the distribution encoded by the table of probabilities in this node. If the function requires several arguments, a necessary number of children are generated recursively. The recursive generation terminates in a node whenever a terminal is generated in this node and thus no children have to be generated. Since the probabilistic model is built from an actual population of programs, the sampling will never cross the boundaries of the model.
Using the probabilistic model of PIPE to model and sample candidate programs resembles the univariate marginal distribution algorithm (UMDA) [12, 1], which models each string position independently of the values in other positions. Interactions between each node and its context are ignored. That is why it can be expected that using this model will lead to inferior results on problems where program components interact strongly, similarly as the univariate model generally fails if string positions interact . On the other hand, if different program components are mutually independent, PIPE should work great. This intuition is verified with experiments in Section 4.
We implemented PIPE by incorporating probabilistic recombination into the lilgp library developed by GARAGe at the Michigan State University.
3 Test problems
In order to test scalability, we need a class of problems where size can be modified while the inherent problem difficulty does not grow prohibitively fast. In fixed-length string GAs, decomposable problems of bounded difficulty  can be used as a challenging but solvable class of problems. Two types of decomposable problems for fixed-length string GAs are common: Onemax and concatenated traps. In onemax, the contribution of each bit is independent of its context. On the other hand, in concatenated traps, bits in each trap partition interact and cannot be effectively processed without considering other bits in the same trap partition.
ORDER: OneMax-like, GP-easy problem.
TRAP: Deceptive-trap-like, GP-difficult problem
ORDER should be easy for any recombination-based GP. However, since standard variation operators do not consider interactions between different program components, TRAP can be expected to lead to exponential scalability of both standard GP and PIPE. The problems are described next.
3.1 Problem 1: Order
The primitive set of an -primitive ORDER problem consist of a binary function JOIN and complimentary terminals and for . A candidate solution of the ORDER problem is a binary tree with JOIN in all internal nodes and either ’s or ’s at its leaves. The candidate solution’s output is determined by parsing the program tree inorder (from left to right). The program expresses if, during the inorder parse, is encountered before its complement and neither nor its complement are encountered earlier. For all , if is unexpressed, is expressed instead. One terminal is thus expressed from each pair and .
For all , an equal unit of fitness value is accredited if is expressed:
The fitness function for ORDER is defined as
where is the set of primitives expressed by the program. Given that trees can be sufficiently large, the expression for a globally optimal solution of an primitive ORDER problem is and thus its fitness value is .
For example, consider a candidate solution for a 4-primitive ORDER problem shown in Figure 3. The sequence of leaves visited during the inorder parse is , the expression of this sequence is , and the fitness of this solution is thus .
3.2 Problem 2: Deceptive Trap
In standard GAs, deceptive functions [3, 5] are designed to thwart the very mechanism of selectorecombinative search by punishing any localized hillclimbing and requiring mixing of whole building blocks at or above the order of deception. Using such adversarially designed functions is a stiff test—in some sense the stiffest test—of algorithm performance. The idea is that if an algorithm can beat an adversarially designed test functions, it can solve other problems that are equally hard or easier than the adversary. Furthermore, if the building blocks of such deceptive functions are not identified and respected by selectorecombinative GAs, then they almost always converge to the local minimum.
TRAP is designed to test the same mechanisms in GP. Fitness is computed so that if interactions between different components of the program are not considered, optimization may be mislead away from the global optimum. Similarly as with standard GAs on deceptive functions, standard GP is expected to fail in solving TRAP scalably, indicating the need for linkage learning in GP.
Programs in TRAP also consist of one binary function JOIN and pairs of complementary primitives and . The expression mechanism of the program for TRAP is identical to that to that of ORDER. The difference is in the fitness evaluation procedure.
In TRAP, the expressed set of primitives is first mapped to an -bit binary string. The th bit of the string is if and only if was expressed; otherwise, the th bit of the string is . The resulting binary string is then partitioned into groups of bits each (the partitioning is fixed during the entire run) and a trap function is applied to each group:
where is the number of ones in the input string of bits.
The fitness function of the trap function is then computed by adding the contributions of all groups of bits together.
The difficulty of trap can be adjusted by modifying the values , and . The problem becomes more difficult as the value of is increased and that of is decreased. A -bit deceptive trap function is illustrated in Figure 4. In this paper we use traps with and .
The important feature of additively separable trap functions is that if looking at the performance of any subset of bits corresponding to one trap, it seems to be better to propagate s (here we need to eliminate s and substitute or nothing). As shown in , if interactions between different components of the program are not considered, it can be expected that GP will scale up poorly on this problem.
3.3 Other primitives
In addition to ORDER and TRAP with JOIN and terminal pairs, we tested GP and PIPE on ORDER with additional two primitives: A primitive negative join and junk or unexpressed terminals. The purpose of additional tests was to determine how GP and PIPE respond to more complex interactions and unnecessary program primitives.
3.3.1 Primitive negative-join
NEG_JOIN affects all its descendant terminals by expressing each primitive as its negation ; analogically, all descendants are expressed as . If a terminal has more NEG_JOIN ancestors, only one of them is considered and the terminal is negated only once.
NEG_JOIN is unnecessary for solving ORDER and it does not introduce a less complex or easier to find global optimum. Furthermore, NEG_JOIN introduces interactions into ORDER because the best value in each leaf depends on its ancestors. Nonetheless, these interactions are relatively simple as many leaves are expected to contain NEG_JOIN on the path to the root.
For example, for the program shown in Figure 5, the inorder pass through the program results in the following sequence of leaves: . The expression gives us , and thus the fitness is .
3.3.2 Junk-code terminals
Junk-code or JUNK terminals represent unnecessary primitives that are irrelevant for the particular problem. In biological terms, JUNK terminals correspond to junk code in DNA. During the expression phase, JUNK terminals are simply ignored and they thus do not influence the overall fitness at all.
Adding JUNK terminals makes the optimization problem more difficult, because additional primitives enlarge the search space without simplifying the problem. The influence of JUNK terminals can be tuned by changing the number of unique JUNK terminals.
Figure 6 shows a tree with two JUNK terminals. The inorder parse results in the following sequence of leaves (ignoring JUNK): . The expression gives us , and thus the fitness of this solution is .
This section compares the performance of GP and PIPE on three variants of ORDER and one variant of TRAP.
4.1 Description of experiments
The scalability of GP and PIPE was tested on four classes of problems:
Basic ORDER (no JUNK or NEG_JOIN),
basic TRAP (no JUNK or NEG_JOIN),
ORDER with NEG_JOIN, and
ORDER with JUNK terminals, where the number of unique JUNK terminals is set to .
The scalability experiments were performed by testing both algorithms on problem instances with an increasing number of primitives.
Additionally, the effects of increasing the number of unnecessary primitives on the performance of GP and PIPE were studied by testing GP and PIPE on a -primitive ORDER with an increasing number of JUNK terminals (from to ).
Binary tournament selection was used in both GP and PIPE. The probability of crossover in GP is set to . To focus on the effects of recombination, no mutation is used. The initial population in both methods was generated using the standard half-and-half method. Maximum tree depth was set to be one more than the depth of the minimum tree to store the global optimum. The population size that is within of the minimum population size required to solve 30 independent runs is used. The population size is determined using a bisection method. The runs are terminated when the algorithms find the global optimum or when the number of generations is too large for the particular problem.
Figure 7 shows the scalability of GP and PIPE on ORDER without NEG_JOIN or JUNK terminals. Problem instances of different size were examined; more specifically, , , , , , , and . The figure shows the average number of function evaluations of 30 successful runs with respect to the problem size (number of positive literals). The results indicate that PIPE is slightly more efficient than GP but both GP and PIPE scale up with a low-order polynomial. These results are in agreement with the behavior observed in binary-string GAs on the simple onemax problem. On onemax, both simple GA and UMDA find the optimum in low-order polynomial time [13, 7, 5, 15]; however, UMDA performs slightly better  because it uses a more effective recombination for this type of problems.
Figure 8 compares the scalability of GP and the PIPE on TRAP without NEG_JOIN or JUNK terminals. The size of one trap is and the signal difference is . Problem instances of different size were examined; more specifically, , , , , , and . On TRAP, GP performs slightly better than PIPE. This can be explained by its weaker recombination operator because here recombination causes disruption of important partial solutions  as can be hypothesized based on the performance of standard GAs on similar problems. Nonetheless, both GP and PIPE scale up poorly and they indicate an exponential growth of the number of function evaluations with problem size.
Figure 9 compares the scalability of GP and PIPE on ORDER with NEG_JOIN. Problem instances of different size were examined; more specifically, , , , , , , and . Both GP and PIPE perform similarly as on basic ORDER without NEG_JOIN, but there is a slight decrease in their performance because of the interactions introduced by NEG_JOIN.
Figure 10 compares the scalability of GP and PIPE on ORDER with unique JUNK terminals. For example, a problem instance with positive terminals contains unique JUNK terminals. Both GP and PIPE seem to be capable of dealing with these irrelevant terminals and achieve performance comparable to that on basic ORDER.
The last two sets of experiments are similar in that they show how the performance of GP and PIPE changes when adding irrelevant terminals into the representation. ORDER with terminals is used with the number of JUNK terminals ranging from to (, , , , and ). The experiments differ in the bound on the maximum tree depth. Figure 11 shows the results with the depth limited to at most (so there are at most levels including the root). Figure 12 shows the results with the depth limited to at most (so there are at most levels including the root). The problem with the smaller maximum depth is more difficult for both GP and PIPE because JUNK terminals obstruct the creation of an optimal solution that is only slightly larger than the maximum allowed tree. PIPE deals better with this “lack of space” than GP does. However, in both cases, the number of evaluations still appears to grow with a low-order polynomial or slower as irrelevant terminals are added.
5 Future Work
Future work should study the scalability of GP, PIPE, and other similar approaches on the problems presented in this paper and other problems where problem size can be modified without affecting the inherent problem difficulty. The efforts to introducing linkage learning into GP (for example [18, 2]) should continue to succeed in the design of robust GP methods that provide a scalable solution to broad classes of GP problems. Finally, more theory should be designed to match the achievements in this area in the domain of GAs [13, 20, 5, 15, 11, 10].
This paper focused on the scalability of two GP algorithms: standard GP and PIPE.
Two basic test functions were used: ORDER and TRAP. Both functions were defined using one binary function JOIN and complementary terminal pairs and for . ORDER can be solved without considering interactions between different program components, whereas TRAP introduces strong interactions, which make this function difficult for both standard crossover and mutation of GP, as well as the probabilistic recombination of PIPE.
The scalability of GP and PIPE was tested on basic ORDER and TRAP. Additionally, ORDER was extended by adding either of the following two primitives: (1) a binary function NEG_JOIN and (2) JUNK (or irrelevant) terminals. Thus, there were 4 problem types examined.
On all four problem types, the scalability of GP and PIPE was first tested by applying these algorithms to problem instances of different size (number of positive terminals). Then, the sensitivity of GP and PIPE to the proportion of irrelevant terminals to the relevant ones was examined.
The results presented in this paper indicate that the behavior of different variants of GP can be expected to be similar to that of standard binary-string GAs. There are two important consequences of this fact. First, as it was indicated in , to solve some classes of problems scalably, linkage learning may have to be incorporated into GP in order to identify and exploit interactions between different program components. Second, the lessons learned in the design and application of binary-string GAs should carry over to GP as argued for example in [6, 19]; the first steps along this direction are represented by the decision-making model of the population sizing in GP , which was based on the decision-making population-sizing model for standard GAs [7, 5].
The results also indicate that if the recombination operator captures interactions in the problem properly, increasing the mixing effects of recombination leads to better performance. That is why PIPE outperformed standard GP on problems where program components could be treated independently. This fact together with the need for linkage learning should encourage the application of probabilistic recombination operators of estimation of distribution algorithms (EDAs) [12, 9, 14] to the domain of GP. Some representatives of EDAs applied to the GP domain are [16, 17, 18, 2].
Finally, the results show that both GP and PIPE can deal with irrelevant terminals and unnecessary functions relatively well and their performance gets only slightly worse when adding these primitives.
This work was partially supported by the Research Award and the Research Board at the University of Missouri. Some experiments were done using the hBOA software developed by Martin Pelikan and David E. Goldberg at the University of Illinois at Urbana-Champaign. This work was also sponsored by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant F49620-03-1-0129, the National Science Foundation under ITR grant DMR-99-76550 (at Materials Computation Center), and ITR grant DMR-0121695 (at CPSD), and the Dept. of Energy under grant DEFG02-91ER45439 (at Fredrick Seitz MRL). The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation thereon.
-  S. Baluja. Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. Tech. Rep. No. CMU-CS-94-163, Carnegie Mellon University, Pittsburgh, PA, 1994.
-  P. A. N. Bosman and E. D. de Jong. Learning probabilistic tree grammars for genetic programming. pages 190–199, 2004.
-  K. Deb and D. E. Goldberg. Analyzing deception in trap functions. Foundations of Genetic Algorithms, 2:9–108, 1993.
D. E. Goldberg.
Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA, 1989.
D. E. Goldberg.
The design of innovation: Lessons from and for competent genetic
algorithms, volume 7 of
Genetic Algorithms and Evolutionary Computation. Kluwer Academic Publishers, 2002.
-  D. E. Goldberg and U.-M. O’Reilly. Where does the good stuff go, and why? how contextual semantics influence program structure in simple genetic programming. Proceedings of the First European Workshop on Genetic Programming, 1391:16–36, 14-15 Apr. 1998.
-  G. R. Harik, E. Cantú-Paz, D. E. Goldberg, and B. L. Miller. The gambler’s ruin problem, genetic algorithms, and the sizing of populations. Proceedings of the International Conference on Evolutionary Computation (ICEC-97), pages 7–12, 1997. Also IlliGAL Report No. 96004.
-  J. R. Koza. Genetic programming: On the programming of computers by means of natural selection. MA: The MIT Press, Cambridge, 1992.
-  P. Larrañaga and J. A. Lozano, editors. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer, Boston, MA, 2002.
-  F. G. Lobo, D. E. Goldberg, and M. Pelikan. Time complexity of genetic algorithms on exponentially scaled problems. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000), pages 151–158, 2000. Also IlliGAL Report No. 2000016.
B. L. Miller and D. E. Goldberg.
Optimal sampling for genetic algorithms.
Intelligent Engineering Systems through Artificial Neural Networks, 6:291–297, 1996.
-  H. Mühlenbein and G. Paaß. From recombination of genes to the estimation of distributions I. Binary parameters. Parallel Problem Solving from Nature, pages 178–187, 1996.
-  H. Mühlenbein and D. Schlierkamp-Voosen. Predictive models for the breeder genetic algorithm: I. Continuous parameter optimization. Evolutionary Computation, 1(1):25–49, 1993.
-  M. Pelikan, D. E. Goldberg, and F. Lobo. A survey of optimization by building and using probabilistic models. Computational Optimization and Applications, 21(1):5–20, 2002. Also IlliGAL Report No. 99018.
-  M. Pelikan, K. Sastry, and D. E. Goldberg. Scalability of the Bayesian optimization algorithm. International Journal of Approximate Reasoning, 31(3):221–258, 2002. Also IlliGAL Report No. 2001029.
-  R. Salustowicz and J. Schmidhuber. Probabilistic incremental program evolution. Evolutionary Computation, 5(2):123–141, 1997.
-  R. Salustowicz and J. Schmidhuber. H-PIPE: Facilitating hierarchical program evolution through skip nodes. Technical Report IDSIA-08-98, Instituto Dalle Molle di Studi sull’ Intelligenza Artificiale (IDSIA), Lugano, Switzerland, 1998.
-  K. Sastry and D. E. Goldberg. Probabilistic model building and competent genetic programming. April 2003.
-  K. Sastry, U.-M. O’Reilly, and D. E. Goldberg. Convergence-time models for the simple genetic algorithm with finite population. IlliGAL Report No. 2001028, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory, Urbana, IL, 2001.
-  D. Thierens. Analysis and design of genetic algorithms. PhD thesis, Katholieke Universiteit Leuven, Leuven, Belgium, 1995.