Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods

03/20/2017 ∙ by William La Cava, et al. ∙ University of Pennsylvania 0

Recently we proposed a general, ensemble-based feature engineering wrapper (FEW) that was paired with a number of machine learning methods to solve regression problems. Here, we adapt FEW for supervised classification and perform a thorough analysis of fitness and survival methods within this framework. Our tests demonstrate that two fitness metrics, one introduced as an adaptation of the silhouette score, outperform the more commonly used Fisher criterion. We analyze survival methods and demonstrate that ϵ-lexicase survival works best across our test problems, followed by random survival which outperforms both tournament and deterministic crowding. We conduct a benchmark comparison to several classification methods using a large set of problems and show that FEW can improve the best classifier performance in several cases. We show that FEW generates consistent, meaningful features for a biomedical problem with different ML pairings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

When traditional genetic programming (GP) is applied to classification and/or regression, individual programs assume the roles of feature selection, transformation, and model prediction, and are evaluated for their ability to make accurate estimations and/or predictions. The flexibility of evolving the structure and parameters of a model comes with a heavy computational cost that can be mitigated if one instead uses a fast (e.g. polynomial-time) machine learning (ML) method to optimize the parameters of a GP model with respect to an objective function (for example, least squares error minimization with linear regression). With this in mind, many variants of GP have been proposed that embed linear regression and/or local search in each program, leading to better models 

(Iba and Sato, 1994; Kommenda et al., 2013; Arnaldo et al., 2014; La Cava et al., 2015). The high-level takeaway from the success of methods that hybridize GP is that it is best to focus the computational effort of GP on the parts of the modeling process that are known to be NP-hard, namely the tasks of feature selection (Foster et al., 2015) and construction (Krawiec, 2002).

The task of feature construction, also known as feature engineering or representation learning, is well-motivated since the central factor affecting the quality of a model derived from ML is the ability of the data representation to facilitate learning (Bengio et al., 2013). This paper focuses on the supervised classification task, for which the goal is to find a mapping

that associates the vector of attributes

with class labels from the set using paired examples . The goal of feature engineering is to find a new representation of via a -dimensional feature mapping , such that a classifier more accurately classifies samples than .

GP-based approaches to representation learning include evolving single features for decision trees (DT) 

(Muharram and Smith, 2005), or coupling ML with each program (Krawiec, 2002; Silva et al., 2015; Žegklitz and Pošík, 2017). Recent work (De Melo, 2014; Arnaldo et al., 2015) has advocated what we refer to as an “ensemble” approach which treats the entire GP population as , with each program representing a transformation of the form . These proposed methods feed the population output into a linear regression model to make predictions.

The ML-specific nature of these previous approaches motivates our development of the more general feature engineering wrapper (FEW) method111Available from https://lacava.github.io/few and via the Python Package Index: https://pypi.python.org/pypi/FEW, which is a wrapper-based ensemble method of feature engineering with GP (La Cava and Moore, 2017). Unlike previous approaches, FEW allows any learning algorithm in scikit-learn format (Pedregosa et al., 2011) to be used for estimation. FEW has been demonstrated for use in regression with several ML pairings, including Lasso (Tibshirani, 1996)

, linear and nonlinear support vector regression, DT, and k-nearest neighbors (KNN). Central to its ability to evolve features in a single population is the introduction of

-lexicase survival which produces uncorrelated population behavior.

The wrapper-based ensemble approach to GP is under-studied and presents new challenges from an evolutionary computation standpoint, namely the need for individuals in the population to complement each other in facilitating the learning of the ML method with which they are paired. Our goal in this paper is to use FEW as a test bed for evaluating the ability of several survival and fitness techniques in this new framework for supervised classification. In addition, whereas previously FEW was demonstrated in side-by-side comparisons with default ML methods, here we more robustly analyze whether FEW can, in general, produce better models than existing ML techniques when hyper-parameter optimization of every method is considered.

This paper contains four main contributions. First, it presents a much-needed analysis of fitness and survival methods for ensemble-based representation learning with GP, which is currently lacking in the field. Second, it focuses on the classification task, which has not been the focus of previous methods with this GP framework. Third, it presents robust comparisons of FEW to other ML methods, including a previously proposed GP method that also focuses on feature learning. As a final contribution we analyze a biomedical problem for which FEW is able to correctly identify the nonlinear, underlying structure of the data across ML pairings, thereby showing the usefulness of learning readable data representations.

We pair FEW with several well-known classifiers in our analysis: logistic regression (LR), support vector classification (SVC), KNN, DT and random forests (RF). We present an overview of FEW in Section

2 including a description of several fitness and survival methods that are tested. We review related work more thoroughly in Section 3, including distinguishing between wrapper and filter approaches as well as single, multiple, and ensemble representations of features in GP. The results of the experiments on FEW and its comparison to other methods is shown in Section 5, with discussion and conclusions following in Section 6.

2. Methods

The components of FEW are summarized in Figure 1. The learning process begins by fitting the ML method to the original data. FEW maintains an internal validation set to evaluate new models, which guarantees that the returned model will have a cross-validation (CV) fitness at least as good as the initial data representation can produce. FEW then initializes a population of feature transformations, , seeded with the features from the initial ML model with non-zero coefficients. Each generation, a new ML model is trained on to produce .

Figure 1. A diagram showing the main steps in FEW. denotes the starting population; is the population after selection; is the offspring produced by crossover and mutation; and is the new population after conducting survival on ,.

The selection step of FEW is the entry point for new information from the ML method about the quality of the current representation. Methods that admit regularization (available in the scikit-learn implementations of LR and SVC) or feature importance scores (DT and RF) apply selective pressure to the GP population by eliminating any individuals with a corresponding coefficient or feature importance of zero in the ML model. Feature importance for DT and RF is measured using the Gini importance (Breiman and Cutler, 2003). Thus ML and GP share the feature selection role. After selection, the remaining individuals ( in Figure 1) are used to produce offspring, , via sub-tree crossover and point mutation. In this way FEW differs from previous ensemble representation learning approaches (Arnaldo et al., 2015; McConaghy, 2011) in that it incorporates crossover for variation instead of strict mutation.

The fitness step (see Section 2.1) evaluates the ability of and to adequately distinguish between classes in . The survival step in FEW (see Section 2.2) reduces the pool of parents and offspring back to the original size (), and the surviving set of transformations, , is used at the beginning of the next generation to fit a new ML model.

2.1. Fitness

We compare the three fitness metrics (Eqns. 13 below) in our experimental analysis in Section 4.1. In contrast to traditional GP, the fitness of an engineered feature must measure the individual’s ability to separate data between classes rather than its predictive capacity, since is not itself a model. A simple approach to assessing feature quality is to look at the coefficient of determination using


For binary classification, seems appropriate, since it only has to capture the correlation of the feature with a change from 0 to 1. However, for multiclass classification, the imposes an additional constraint on the feature by rewarding it for increasing in the direction of the class label values. For certain problems (e.g. one in which the ordering of the class labels corresponds to a degree of risk), this imposed fitness pressure may be warranted, but in the general case we do not want to assume the order of the class labels, nor the relative distance between them in a feature, is meaningful. Instead, we want to reward features that separate samples from different classes and cluster samples within classes.

Other GP feature construction methods have used the Fisher criterion (Guo and Nandi, 2006; Ahmed et al., 2014) for achieving such a measure. The Fisher criterion assigns fitness of a feature as


where is the mean of belonging to a class label, i.e. , and

is the standard deviation. The Fisher criterion gives a measure of the average pairwise separation between, and dispersion within, classes for

. However, it does not provide fine-grained information about the distance of specific samples in the transformation. In an attempt to extract this information, we include the silhouette score (Rousseeuw, 1987) in our comparisons. Like Eqn. 2

, the silhouette score assesses feature quality by combining the within-class variance with the distance between neighboring classes. Thus it captures both the tightness of a cluster and its overlap with the nearest cluster. The silhouette score

for a single sample is defined as


Here, is the set of samples with class label , and is the set of samples in the next nearest class (according centroid distance). Thus Eq. (3) takes into account both the pairwise square distances within a class and the separation of neighboring classes from each other. Here the Euclidean distance metric is used. For aggregate fitness of an engineered feature, the average silhouette score over all samples, , is used.

2.2. Survival

Unlike typical populations in model-based GP, the surviving individuals in FEW are assessed together in an ML estimation, and therefore benefit from being chosen to work well together. In fact, many ML pairings depend on low co-linearity between features, including LR and SVC. We test four methods for achieving this cooperation: tournament survival (tournaments of size 2), deterministic crowding, -lexicase survival, and random survival. Tournament survival is agnostic to the population structure when selecting survivors, and simply picks the individual in the tournament with the best fitness to survive. Meanwhile, deterministic crowding and -lexicase survival are designed to promote feature diversity, which should influence the ability of the population to effectively produce a representation for the ML training step. We include random survival tests to control for the effect of unguided search.

Deterministic crowding (Mahfoud, 1995) is a niching mechanism in which offspring compete only with the parent they are most similar to. We define similarity as the correlation (, Eqn. 1) between a child and its offspring. In the case of mutation, there is only one parent, so no similarity comparison is necessary. Although traditionally a steady state algorithm, its implementation here is generational. Children take the place of their parent in the surviving population if and only if they have a better fitness. This algorithm produces niches in the population which should maintain diverse features.

-lexicase survival is a new survival technique adapted from -lexicase selection (La Cava et al., 2016) for use in FEW. -lexicase selection is, in turn, an adaptation of lexicase selection (Spector, 2012; Helmuth et al., 2014) for continuous-valued problems. Lexicase selection works by pressuring individuals in the population to solve unique subsets of the training samples (i.e. cases) and shifting selective pressure to cases that are the most difficult in terms of population performance. -lexicase survival differs from -lexicase selection in that it removes the individuals selected at each step from the remaining selection pool, and adds them to the survivors for the next generation. Each iteration of -lexicase survival proceeds as follows:

GetSurvivors() :
     training cases
     for each parent selection:
         initial pool
         () for get for each case
         while and : main loop
             case random choice from pick a case
             elite best fitness in on case determine elite
             if fitness() elite reduce pool
             case reduce cases
         random choice from pick survivor
     return return survivors

In the routine above, is the median absolute deviation of the fitnesses on case across the population.

3. Related Work

Feature construction has received considerable attention in GP, with implementations falling into single feature, multiple feature and ensemble categories. Single feature representations attempt to evolve a single solution that is an engineered feature as in (Muharram and Smith, 2005; Guo and Nandi, 2006). Multiple feature representations encode a candidate set of feature transformations in each individual (Krawiec, 2002; Smith and Bull, 2005; Silva et al., 2015; La Cava, William et al., 2017), such that each individual is a multi-output estimate of . In this case, a separate ML model is trained on the outputs of each program, and the resulting output is used to assign fitness to each individual. Ensembles are a more recent approach (McConaghy, 2011; De Melo, 2014; Arnaldo et al., 2015; La Cava and Moore, 2017) designed to reduce the computational complexity of fitting a model to each individual. Ensemble approaches instead fit a single ML model to the output of the entire population. This ensemble-like approach treats each individual in the population as single features , and treats the ensemble output of the population as . Among these ensemble methods, FEW shares the most in common with evolutionary feature synthesis (EFS) (Arnaldo et al., 2015) in that it uses the more successful wrapper-based approach (Krawiec, 2002; Smith and Bull, 2005) and incorporates feature selection information from the ML routine. Unlike FEW, EFS pairs exclusively with Lasso (Tibshirani, 1996), uses three population partitions, and does not incorporate crossover between individuals. FEW is motivated by the hypotheses that 1) the ML pairing is best treated like a hyper-parameter of the method, and 2) that existing diversity-preserving selection methods can be successfully adapted to the purposes of ensemble-based feature survival. As a final note, previous work does not often consider the effect of tuning the proposed algorithm or the ML approaches to which is compared, which is a vital step in algorithm comparisons (Caruana and Niculescu-Mizil, 2006) and in the application of ML to real-world problems.

4. Experimental Setup

We conduct two separate sets of experiments. The first set described in Section 4.1 is designed to compare the fitness and survival methods for FEW in combination with different ML methods and hyper-parameters. We use the results the first experiment to choose the fitness and survival method for FEW in the second set of experiments. The second set of experiments, described in Section 4.2, is a benchmark comparison of FEW to several ML methods on a larger set of classification problems. All the datasets used in the comparison are freely available via the Penn Machine Learning Benchmark repository222https://github.com/EpistasisLab/penn-ml-benchmarks.

4.1. FEW comparisons

We tune the choice of fitness and survival methods by performing an experimental analysis of FEW on the tuning problems in Table 3 using the parameters listed in Table 1.

Setting Values
Population size 10, 50, 100
Max depth 2,3
Fitness R2, silhouette
Survival tournament, deterministic crowding, -lexicase
Table 1. Tested settings for survival and fitness study.

4.2. Comparison to other methods

We evaluate FEW’s performance in comparison to six other ML approaches: Gaussian naïve Bayes (NB), LR, KNN, SVC, RF, and M4GP (La Cava, William et al., 2017), a multi-feature GP method derived from (Silva et al., 2015) that couples a multi-feature representation with a nearest centroid classifier (Tibshirani et al., 2002). For more information on the implementations of NB, LR, KNN, SVC, and RF, refer to (Pedregosa et al., 2011). These methods are evaluated on 20 classification problems that vary in numbers of classes, samples and features, as seen in Table 2. To ensure robust comparisons, we include hyper-parameter optimization in the training phase for each method. To do so, we do a grid search of the hyper-parameters of each method (shown in Table 2), using 5-fold cross-validation on the training set to choose the final parameters. The model with the best average cross validation accuracy on the training set is evaluated on the test set. This process is repeated for 30 shuffled, 50/50 train/test splits of the data. In an attempt to control for the different possible hyper-parameter combinations between the methods, we limited each grid search to a maximum of 100 combinations of hyper-parameter settings during training.

The hyper-parameters considered for FEW (see Table 2) include the population size, the ML method. expressed as a function of the number of features in the data, the output type of the features (float or bool), and max feature depth. Floating point outputs use the operator set , , , , , , , , , , and boolean outputs add AND, OR, XOR, , , , , , . It is important to note that the tuning of the ML method is not considered when paired with FEW. As a result, this experiment compares the relative effects of learning a representation for a default ML method to tuning the hyper-parameters of those methods.

Method hyper-parameters
FEW Population (0.25,,3); ML (LR, KNN, RF, SVM); output type (bool, float); max depth (2,3)
M4GP Population size (250, 500, 1000); generations (50,100,500,1000); selection method (tournament, lexicase); max length (10, 25, 50, 100)
Gaussian Naïve Bayes none
Logistic Regression Regularization coefficient (0.001,…,100); penalty (,

,elastic net); epochs (5,10)

Support Vector Classifier Regularization coefficient (0.01,…,100,‘auto’);

(0.01, 10, 1000, ‘auto’); kernel (linear, sigmoid, radial basis function)

Random Forest Classifier No. estimators (10, 100, 1000); minimum weight fraction for leaf (0.0, 0.25, 0.5); max features (, , None); splitting criterion (entropy, gini)
K-Nearest Neighbor Classifier K (1,…,50); weights (uniform, distance)
Table 2. Experimental setup for the method comparisons. The hyper-parameters that were searched are shown on the right.
Dataset Classes Samples Features
Tuning Problems
auto 5 202 25
calendarDOW 5 399 32
corral 2 160 6
new thyroid 3 215 5
Benchmark Problems
analcatdata authorship 4 841 70
analcatdata cyyoung8092 2 97 10
coil2000 2 9822 85
GMT 2w-1000a-0.4h 2 1600 1000
GMT 2w-20a-0.4h 2 1600 20
german 2 1000 20
Hill Valley with noise 2 1212 100
Hill Valley without noise 2 1212 100
magic 2 19020 10
mfeat fourier 10 2000 76
mfeat pixel 10 2000 240
molecular biology promoters 2 106 58
monk2 2 601 6
optdigits 10 5620 64
parity5+5 2 1124 10
schizo 2 340 14
texture 11 5500 40
vowel 11 990 13
xd6 2 973 9
yeast 9 1479 8
Table 3. Classification data sets used in this paper for tuning (top) and comparison to other methods (bottom). GMT stands for GAMETES data sets, which are named according to number of epistatic loci (w), number of attributes (a), noise fraction (n), and heterogeneity fraction (h).

5. Results

The fitness and survival methods are compared on the tuning datasets in Figures 2 and  3, respectively. The fitness metric comparisons yield unexpected results. The Fisher criterion is outperformed by both R and the silhouette score in 3 out of 4 problems ( 4.8e-7). Surprisingly we find that the silhouette score does not outperform R as a fitness metric either; across problems and ML pairings, there is no significant difference in performance aside from new-thyroid. This is surprising given our hypothesis in Section 2.1 that the class label assumptions implicit in the R would make it less suited to classification with multiple labels. According to this evidence in conjunction with the lower complexity of , we opt to use as the fitness criterion for the benchmark comparison.

We find that

-lexicase survival produces more accurate classifiers than deterministic crowding, tournament and random survival across problems and ML pairings. It is significantly correlated with higher test accuracy according to a t-test (

2e-16) and significantly outperforms tournament ( 0.002) and deterministic crowding ( 2.4e-7) according to all pairwise Wilcoxon tests, correcting for multiple comparisons. -lexicase survival also outperforms random survival on auto ( 4.4e-8) and new-thyroid ( 2e-16), and ties it on the other two problems (for calendarDOW, 0.094). Random survival performs strongly compared to tournament and deterministic crowding survival, outperforming those methods on 3 out of 4 problems. The results motivate our use of -lexicase survival in the benchmark comparison.

The test set accuracies of the 7 method comparisons on the benchmark datasets are shown in boxplot form in Figure 4 and the mean rankings are summarized in Figure 5. Across problems, performance varies, generally with RF, SVC, M4GP or FEW producing the highest test accuracy. Whereas FEW generally does well on the problems for which M4GP excels, FEW also does well in cases where M4GP underperforms, which is likely due to FEW’s ability to tune the ML method with which it is paired. Three problems stand out for being particularly amenable to feature engineering: GMT 2w-20a-0.4h, Hill_Valley_without_noise, and parity5+5. These three problems are well-known for containing strong interactions between features, which helps explain the observed increase in performance from FEW. In terms of mean rankings across problems, FEW generates the best classifiers among the methods tested, followed closely by SVC and RF. A Friedman test of the rankings with post-hoc analysis reveals RF, SVC, and FEW significantly outperform NB and LR across all problems (0.039).

As expected, the computation time of FEW is higher than other ML methods (see Figure 6) due to its wrapper-based approach. The quicker performance of M4GP may be explained by its c++ implementation compared to FEW’s Python implementation, as well as M4GP’s use of a consistently fast ML pairing.

We show models generated with single runs of FEW on GMT 2w-20a-0.4h in Table 4 using DT and LR. This genetics problem is generated using the GAMETES simulation tool (Urbanowicz et al., 2012). It consists of 20 attributes, 18 of which are noise, and two of which interact epistatically, meaning they must be considered together to infer the correct class (the labels contain noise as well). The models correctly identify the interaction between features 18 and 19. For this problem FEW’s transformation provides the essential knowledge required to solve this problem, whereas the ML approaches simply serve as a discriminant function for processing the information presented via the transformation.

Figure 2. A comparison of fitness definitions on the tuning data sets. Each subplot presents a different data set; the x-axis corresponds to the paired learner (DT, LR, KNN) and the boxplots represent the accuracy scores obtained with silhouette score (Eqn.3) or R (Eqn. 1).
Figure 3. A comparison of survival algorithms on the tuning data sets. Each subplot presents a different data set; the x-axis corresponds to the paired learner (DT, LR, KNN) and the boxplots represent the accuracy scores using different survival methods.
Figure 4. Comparison of test set accuracy for various methods on the benchmark problems.
Figure 5.

Ranking of methods over all of the benchmark problems. Bars indicate the standard error.

Figure 6. Training time of methods over all of the benchmark problems.
Decision Tree Model
Importance Feature
0.899 XOR
Logistic Regression Model
Coefficient Feature
1.992 XOR
0.996 XOR
Performance Decision Tree Logistic Regression
Initial ML CV accuracy 0.487 0.473
Final model CV accuracy 0.763 0.803
Test accuracy 0.787 0.755
Runtime (s) 8.2 8.2
Table 4. Example solutions to the GMT-2w-20a-40h problem using decision tree and logistic regression pairings. FEW identifies the correct underlying epistatic interaction between features and in both cases. The final model is either a simple decision tree split in order of the estimated importances or a logistic regression model with four terms.

6. Discussion & Conclusion

Our results suggest that FEW is a useful technique for supervised classification problems. FEW performs the best on average among the algorithms tested, which include optimized SVM, RF, KNN, M4GP, LR and NB models. This result provides evidence with these ML methods that the data representation can influence algorithm performance as much as, if not more than, the parameter settings of those algorithms. Although it hasn’t been tested here, it is likely that including hyper-parameter optimization of the ML methods paired with FEW in the tuning step would show even greater gains in performance over the baseline approach. FEW also performs better than a multiple feature GP approach (M4GP) that uses a fixed ML pairing.

Despite FEW’s runtime in these tests, a complexity analysis suggests it is well-positioned for large datasets in comparison to other feature construction techniques. Whereas techniques like polynomial feature expansion scale poorly with the number of features ( for an -degree polynomial) and techniques like kernel transformations scale poorly with the numbers of samples ((Friedman et al., 2001), FEW scales independently of the features in the dataset, linearly with , and quadratically with the population size. These observations warrant further investigation with large datasets.

7. Acknowledgements

This work was supported by the Warren Center for Network and Data Science at the University of Pennsylvania, as well as NIH grants P30-ES013508, AI116794 and LM009012.


  • (1)
  • Ahmed et al. (2014) Soha Ahmed, Mengjie Zhang, Lifeng Peng, and Bing Xue. 2014. Multiple feature construction for effective biomarker identification and classification using genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM, 249–256. http://dl.acm.org/citation.cfm?id=2598292
  • Arnaldo et al. (2014) Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O’Reilly. 2014. Multiple regression genetic programming. In Proceedings of the 2014 conference on Genetic and evolutionary computation. ACM Press, 879–886. DOI:http://dx.doi.org/10.1145/2576768.2598291 
  • Arnaldo et al. (2015) Ignacio Arnaldo, Una-May O’Reilly, and Kalyan Veeramachaneni. 2015. Building Predictive Models via Feature Synthesis. ACM Press, 983–990. DOI:http://dx.doi.org/10.1145/2739480.2754693 
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6472238
  • Breiman and Cutler (2003) Leo Breiman and Adele Cutler. 2003. Random Forests. (2003). http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
  • Caruana and Niculescu-Mizil (2006) Rich Caruana and Alexandru Niculescu-Mizil. 2006.

    An empirical comparison of supervised learning algorithms. In

    Proceedings of the 23rd international conference on Machine learning. ACM, 161–168.
  • De Melo (2014) Viníícius Veloso De Melo. 2014. Kaizen programming. In GECCO ’14: Proceedings of the Genetic and Evolutionary Computation Conference. ACM Press, 895–902. DOI:http://dx.doi.org/10.1145/2576768.2598264 
  • Foster et al. (2015) Dean Foster, Howard Karloff, and Justin Thaler. 2015. Variable selection is hard. In Proceedings of The 28th Conference on Learning Theory. 696–709. http://www.jmlr.org/proceedings/papers/v40/Foster15.pdf
  • Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics Springer, Berlin. http://statweb.stanford.edu/~tibs/book/preface.ps
  • Guo and Nandi (2006) Hong Guo and Asoke K. Nandi. 2006. Breast cancer diagnosis using genetic programming generated feature. Pattern Recognition 39, 5 (May 2006), 980–987. DOI:http://dx.doi.org/10.1016/j.patcog.2005.10.001 
  • Helmuth et al. (2014) T. Helmuth, L. Spector, and J. Matheson. 2014. Solving Uncompromising Problems with Lexicase Selection. IEEE Transactions on Evolutionary Computation PP, 99 (2014), 1–1. DOI:http://dx.doi.org/10.1109/TEVC.2014.2362729 
  • Iba and Sato (1994) Hitoshi Iba and Taisuke Sato. 1994. Genetic Programming with Local Hill-Climbing. Technical Report ETL-TR-94-4. Electrotechnical Laboratory, 1-1-4 Umezono, Tsukuba-city, Ibaraki, 305, Japan. http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/Iba_1994_GPlHC.pdf
  • Kommenda et al. (2013) Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, and Stefan Wagner. 2013. Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In GECCO ’13 Companion: Proceeding of the fifteenth annual conference companion on Genetic and evolutionary computation conference companion. ACM, Amsterdam, The Netherlands, 1121–1128. DOI:http://dx.doi.org/doi:10.1145/2464576.2482691 
  • Krawiec (2002) Krzysztof Krawiec. 2002. Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 3, 4 (2002), 329–343. http://link.springer.com/article/10.1023/A:1020984725014
  • La Cava et al. (2015) William La Cava, Thomas Helmuth, Lee Spector, and Kourosh Danai. 2015. Genetic Programming with Epigenetic Local Search. In GECCO ’15: Proceedings of the Genetic and Evolutionary Computation Conference. ACM Press, 1055–1062. DOI:http://dx.doi.org/10.1145/2739480.2754763 
  • La Cava and Moore (2017) William La Cava and Jason Moore. 2017. A General Feature Engineering Wrapper for Machine Learning Using -Lexicase Survival. In European Conference on Genetic Programming. Springer, 80–95. https://link.springer.com/chapter/10.1007/978-3-319-55696-3_6 DOI: 10.1007/978-3-319-55696-3_6.
  • La Cava et al. (2016) William La Cava, Lee Spector, and Kourosh Danai. 2016. Epsilon-Lexicase Selection for Regression. In GECCO ’16: Proceedings of the Genetic and Evolutionary Computation Conference. ACM, New York, NY, USA, 741–748. DOI:http://dx.doi.org/10.1145/2908812.2908898 
  • La Cava, William et al. (2017) La Cava, William, Silva, Sara, Vanneschi, Leonardo, Spector, Lee, and Moore, Jason H. 2017. Genetic Programming Representations for Multi-dimensional Feature Learning in Biomedical Classification. In European Conference on the Applications of Evolutionary Computation. Springer, 158-173. https://link.springer.com/chapter/10.1007/978-3-319-55849-3_11 DOI: 10.1007/978-3-319-55849-3_11.
  • Mahfoud (1995) Samir W Mahfoud. 1995.

    Niching methods for genetic algorithms

    Ph.D. Dissertation.
  • McConaghy (2011) Trent McConaghy. 2011. FFX: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice IX. Springer, 235–260. http://link.springer.com/chapter/10.1007/978-1-4614-1770-5_13
  • Muharram and Smith (2005) Mohammed Muharram and George D. Smith. 2005. Evolutionary constructive induction. IEEE Transactions on Knowledge and Data Engineering 17, 11 (2005), 1518–1528. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1512037
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825–2830. http://www.jmlr.org/papers/v12/pedregosa11a.html
  • Rousseeuw (1987) Peter J. Rousseeuw. 1987.

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.

    J. Comput. Appl. Math. 20 (Nov. 1987), 53–65. DOI:http://dx.doi.org/10.1016/0377-0427(87)90125-7 
  • Silva et al. (2015) Sara Silva, Luis Muñoz, Leonardo Trujillo, Vijay Ingalalli, Mauro Castelli, and Leonardo Vanneschi. 2015. Multiclass Classificatin Through Multidimensional Clustering. In Genetic Programming Theory and Practice XIII. Vol. 13. Springer, Ann Arbor, MI.
  • Smith and Bull (2005) Matthew G. Smith and Larry Bull. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6, 3 (2005), 265–281. http://link.springer.com/article/10.1007/s10710-005-2988-7
  • Spector (2012) Lee Spector. 2012. Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion. 401–408. http://dl.acm.org/citation.cfm?id=2330846
  • Tibshirani (1996) Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288. http://www.jstor.org/stable/2346178
  • Tibshirani et al. (2002) Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences 99, 10 (May 2002), 6567–6572. DOI:http://dx.doi.org/10.1073/pnas.082099299 
  • Urbanowicz et al. (2012) Ryan J. Urbanowicz, Jeff Kiralis, Nicholas A. Sinnott-Armstrong, Tamra Heberling, Jonathan M. Fisher, and Jason H. Moore. 2012. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData mining 5, 1 (2012), 1. https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-5-16
  • Žegklitz and Pošík (2017) Jan ŽŽegklitz and Petr Poššíík. 2017. Symbolic Regression Algorithms with Built-in Linear Regression. arXiv:1701.03641 [cs] (Jan. 2017). http://arxiv.org/abs/1701.03641 arXiv: 1701.03641.