When traditional genetic programming (GP) is applied to classification and/or regression, individual programs assume the roles of feature selection, transformation, and model prediction, and are evaluated for their ability to make accurate estimations and/or predictions. The flexibility of evolving the structure and parameters of a model comes with a heavy computational cost that can be mitigated if one instead uses a fast (e.g. polynomial-time) machine learning (ML) method to optimize the parameters of a GP model with respect to an objective function (for example, least squares error minimization with linear regression). With this in mind, many variants of GP have been proposed that embed linear regression and/or local search in each program, leading to better models(Iba and Sato, 1994; Kommenda et al., 2013; Arnaldo et al., 2014; La Cava et al., 2015). The high-level takeaway from the success of methods that hybridize GP is that it is best to focus the computational effort of GP on the parts of the modeling process that are known to be NP-hard, namely the tasks of feature selection (Foster et al., 2015) and construction (Krawiec, 2002).
The task of feature construction, also known as feature engineering or representation learning, is well-motivated since the central factor affecting the quality of a model derived from ML is the ability of the data representation to facilitate learning (Bengio et al., 2013). This paper focuses on the supervised classification task, for which the goal is to find a mapping
that associates the vector of attributeswith class labels from the set using paired examples . The goal of feature engineering is to find a new representation of via a -dimensional feature mapping , such that a classifier more accurately classifies samples than .
GP-based approaches to representation learning include evolving single features for decision trees (DT)(Muharram and Smith, 2005), or coupling ML with each program (Krawiec, 2002; Silva et al., 2015; Žegklitz and Pošík, 2017). Recent work (De Melo, 2014; Arnaldo et al., 2015) has advocated what we refer to as an “ensemble” approach which treats the entire GP population as , with each program representing a transformation of the form . These proposed methods feed the population output into a linear regression model to make predictions.
The ML-specific nature of these previous approaches motivates our development of the more general feature engineering wrapper (FEW) method111Available from https://lacava.github.io/few and via the Python Package Index: https://pypi.python.org/pypi/FEW, which is a wrapper-based ensemble method of feature engineering with GP (La Cava and Moore, 2017). Unlike previous approaches, FEW allows any learning algorithm in scikit-learn format (Pedregosa et al., 2011) to be used for estimation. FEW has been demonstrated for use in regression with several ML pairings, including Lasso (Tibshirani, 1996)
, linear and nonlinear support vector regression, DT, and k-nearest neighbors (KNN). Central to its ability to evolve features in a single population is the introduction of-lexicase survival which produces uncorrelated population behavior.
The wrapper-based ensemble approach to GP is under-studied and presents new challenges from an evolutionary computation standpoint, namely the need for individuals in the population to complement each other in facilitating the learning of the ML method with which they are paired. Our goal in this paper is to use FEW as a test bed for evaluating the ability of several survival and fitness techniques in this new framework for supervised classification. In addition, whereas previously FEW was demonstrated in side-by-side comparisons with default ML methods, here we more robustly analyze whether FEW can, in general, produce better models than existing ML techniques when hyper-parameter optimization of every method is considered.
This paper contains four main contributions. First, it presents a much-needed analysis of fitness and survival methods for ensemble-based representation learning with GP, which is currently lacking in the field. Second, it focuses on the classification task, which has not been the focus of previous methods with this GP framework. Third, it presents robust comparisons of FEW to other ML methods, including a previously proposed GP method that also focuses on feature learning. As a final contribution we analyze a biomedical problem for which FEW is able to correctly identify the nonlinear, underlying structure of the data across ML pairings, thereby showing the usefulness of learning readable data representations.
We pair FEW with several well-known classifiers in our analysis: logistic regression (LR), support vector classification (SVC), KNN, DT and random forests (RF). We present an overview of FEW in Section2 including a description of several fitness and survival methods that are tested. We review related work more thoroughly in Section 3, including distinguishing between wrapper and filter approaches as well as single, multiple, and ensemble representations of features in GP. The results of the experiments on FEW and its comparison to other methods is shown in Section 5, with discussion and conclusions following in Section 6.
The components of FEW are summarized in Figure 1. The learning process begins by fitting the ML method to the original data. FEW maintains an internal validation set to evaluate new models, which guarantees that the returned model will have a cross-validation (CV) fitness at least as good as the initial data representation can produce. FEW then initializes a population of feature transformations, , seeded with the features from the initial ML model with non-zero coefficients. Each generation, a new ML model is trained on to produce .
The selection step of FEW is the entry point for new information from the ML method about the quality of the current representation. Methods that admit regularization (available in the scikit-learn implementations of LR and SVC) or feature importance scores (DT and RF) apply selective pressure to the GP population by eliminating any individuals with a corresponding coefficient or feature importance of zero in the ML model. Feature importance for DT and RF is measured using the Gini importance (Breiman and Cutler, 2003). Thus ML and GP share the feature selection role. After selection, the remaining individuals ( in Figure 1) are used to produce offspring, , via sub-tree crossover and point mutation. In this way FEW differs from previous ensemble representation learning approaches (Arnaldo et al., 2015; McConaghy, 2011) in that it incorporates crossover for variation instead of strict mutation.
The fitness step (see Section 2.1) evaluates the ability of and to adequately distinguish between classes in . The survival step in FEW (see Section 2.2) reduces the pool of parents and offspring back to the original size (), and the surviving set of transformations, , is used at the beginning of the next generation to fit a new ML model.
We compare the three fitness metrics (Eqns. 1–3 below) in our experimental analysis in Section 4.1. In contrast to traditional GP, the fitness of an engineered feature must measure the individual’s ability to separate data between classes rather than its predictive capacity, since is not itself a model. A simple approach to assessing feature quality is to look at the coefficient of determination using
For binary classification, seems appropriate, since it only has to capture the correlation of the feature with a change from 0 to 1. However, for multiclass classification, the imposes an additional constraint on the feature by rewarding it for increasing in the direction of the class label values. For certain problems (e.g. one in which the ordering of the class labels corresponds to a degree of risk), this imposed fitness pressure may be warranted, but in the general case we do not want to assume the order of the class labels, nor the relative distance between them in a feature, is meaningful. Instead, we want to reward features that separate samples from different classes and cluster samples within classes.
where is the mean of belonging to a class label, i.e. , and
is the standard deviation. The Fisher criterion gives a measure of the average pairwise separation between, and dispersion within, classes for. However, it does not provide fine-grained information about the distance of specific samples in the transformation. In an attempt to extract this information, we include the silhouette score (Rousseeuw, 1987) in our comparisons. Like Eqn. 2
, the silhouette score assesses feature quality by combining the within-class variance with the distance between neighboring classes. Thus it captures both the tightness of a cluster and its overlap with the nearest cluster. The silhouette scorefor a single sample is defined as
Here, is the set of samples with class label , and is the set of samples in the next nearest class (according centroid distance). Thus Eq. (3) takes into account both the pairwise square distances within a class and the separation of neighboring classes from each other. Here the Euclidean distance metric is used. For aggregate fitness of an engineered feature, the average silhouette score over all samples, , is used.
Unlike typical populations in model-based GP, the surviving individuals in FEW are assessed together in an ML estimation, and therefore benefit from being chosen to work well together. In fact, many ML pairings depend on low co-linearity between features, including LR and SVC. We test four methods for achieving this cooperation: tournament survival (tournaments of size 2), deterministic crowding, -lexicase survival, and random survival. Tournament survival is agnostic to the population structure when selecting survivors, and simply picks the individual in the tournament with the best fitness to survive. Meanwhile, deterministic crowding and -lexicase survival are designed to promote feature diversity, which should influence the ability of the population to effectively produce a representation for the ML training step. We include random survival tests to control for the effect of unguided search.
Deterministic crowding (Mahfoud, 1995) is a niching mechanism in which offspring compete only with the parent they are most similar to. We define similarity as the correlation (, Eqn. 1) between a child and its offspring. In the case of mutation, there is only one parent, so no similarity comparison is necessary. Although traditionally a steady state algorithm, its implementation here is generational. Children take the place of their parent in the surviving population if and only if they have a better fitness. This algorithm produces niches in the population which should maintain diverse features.
-lexicase survival is a new survival technique adapted from -lexicase selection (La Cava et al., 2016) for use in FEW. -lexicase selection is, in turn, an adaptation of lexicase selection (Spector, 2012; Helmuth et al., 2014) for continuous-valued problems. Lexicase selection works by pressuring individuals in the population to solve unique subsets of the training samples (i.e. cases) and shifting selective pressure to cases that are the most difficult in terms of population performance. -lexicase survival differs from -lexicase selection in that it removes the individuals selected at each step from the remaining selection pool, and adds them to the survivors for the next generation. Each iteration of -lexicase survival proceeds as follows:
|for each parent selection:|
|() for||get for each case|
|while and :||main loop|
|case random choice from||pick a case|
|elite best fitness in on case||determine elite|
|if fitness() elite||reduce pool|
|random choice from||pick survivor|
In the routine above, is the median absolute deviation of the fitnesses on case across the population.
3. Related Work
Feature construction has received considerable attention in GP, with implementations falling into single feature, multiple feature and ensemble categories. Single feature representations attempt to evolve a single solution that is an engineered feature as in (Muharram and Smith, 2005; Guo and Nandi, 2006). Multiple feature representations encode a candidate set of feature transformations in each individual (Krawiec, 2002; Smith and Bull, 2005; Silva et al., 2015; La Cava, William et al., 2017), such that each individual is a multi-output estimate of . In this case, a separate ML model is trained on the outputs of each program, and the resulting output is used to assign fitness to each individual. Ensembles are a more recent approach (McConaghy, 2011; De Melo, 2014; Arnaldo et al., 2015; La Cava and Moore, 2017) designed to reduce the computational complexity of fitting a model to each individual. Ensemble approaches instead fit a single ML model to the output of the entire population. This ensemble-like approach treats each individual in the population as single features , and treats the ensemble output of the population as . Among these ensemble methods, FEW shares the most in common with evolutionary feature synthesis (EFS) (Arnaldo et al., 2015) in that it uses the more successful wrapper-based approach (Krawiec, 2002; Smith and Bull, 2005) and incorporates feature selection information from the ML routine. Unlike FEW, EFS pairs exclusively with Lasso (Tibshirani, 1996), uses three population partitions, and does not incorporate crossover between individuals. FEW is motivated by the hypotheses that 1) the ML pairing is best treated like a hyper-parameter of the method, and 2) that existing diversity-preserving selection methods can be successfully adapted to the purposes of ensemble-based feature survival. As a final note, previous work does not often consider the effect of tuning the proposed algorithm or the ML approaches to which is compared, which is a vital step in algorithm comparisons (Caruana and Niculescu-Mizil, 2006) and in the application of ML to real-world problems.
4. Experimental Setup
We conduct two separate sets of experiments. The first set described in Section 4.1 is designed to compare the fitness and survival methods for FEW in combination with different ML methods and hyper-parameters. We use the results the first experiment to choose the fitness and survival method for FEW in the second set of experiments. The second set of experiments, described in Section 4.2, is a benchmark comparison of FEW to several ML methods on a larger set of classification problems. All the datasets used in the comparison are freely available via the Penn Machine Learning Benchmark repository222https://github.com/EpistasisLab/penn-ml-benchmarks.
4.1. FEW comparisons
|Population size||10, 50, 100|
|Survival||tournament, deterministic crowding, -lexicase|
|ML||LR, DT, KNN|
4.2. Comparison to other methods
We evaluate FEW’s performance in comparison to six other ML approaches: Gaussian naïve Bayes (NB), LR, KNN, SVC, RF, and M4GP (La Cava, William et al., 2017), a multi-feature GP method derived from (Silva et al., 2015) that couples a multi-feature representation with a nearest centroid classifier (Tibshirani et al., 2002). For more information on the implementations of NB, LR, KNN, SVC, and RF, refer to (Pedregosa et al., 2011). These methods are evaluated on 20 classification problems that vary in numbers of classes, samples and features, as seen in Table 2. To ensure robust comparisons, we include hyper-parameter optimization in the training phase for each method. To do so, we do a grid search of the hyper-parameters of each method (shown in Table 2), using 5-fold cross-validation on the training set to choose the final parameters. The model with the best average cross validation accuracy on the training set is evaluated on the test set. This process is repeated for 30 shuffled, 50/50 train/test splits of the data. In an attempt to control for the different possible hyper-parameter combinations between the methods, we limited each grid search to a maximum of 100 combinations of hyper-parameter settings during training.
The hyper-parameters considered for FEW (see Table 2) include the population size, the ML method. expressed as a function of the number of features in the data, the output type of the features (float or bool), and max feature depth. Floating point outputs use the operator set , , , , , , , , , , and boolean outputs add AND, OR, XOR, , , , , , . It is important to note that the tuning of the ML method is not considered when paired with FEW. As a result, this experiment compares the relative effects of learning a representation for a default ML method to tuning the hyper-parameters of those methods.
|FEW||Population (0.25,,3); ML (LR, KNN, RF, SVM); output type (bool, float); max depth (2,3)|
|M4GP||Population size (250, 500, 1000); generations (50,100,500,1000); selection method (tournament, lexicase); max length (10, 25, 50, 100)|
|Gaussian Naïve Bayes||none|
|Logistic Regression||Regularization coefficient (0.001,…,100); penalty (,
,elastic net); epochs (5,10)
|Support Vector Classifier||
Regularization coefficient (0.01,…,100,‘auto’);
(0.01, 10, 1000, ‘auto’); kernel (linear, sigmoid, radial basis function)
|Random Forest Classifier||No. estimators (10, 100, 1000); minimum weight fraction for leaf (0.0, 0.25, 0.5); max features (, , None); splitting criterion (entropy, gini)|
|K-Nearest Neighbor Classifier||K (1,…,50); weights (uniform, distance)|
|Hill Valley with noise||2||1212||100|
|Hill Valley without noise||2||1212||100|
|molecular biology promoters||2||106||58|
The fitness and survival methods are compared on the tuning datasets in Figures 2 and 3, respectively. The fitness metric comparisons yield unexpected results. The Fisher criterion is outperformed by both R and the silhouette score in 3 out of 4 problems ( 4.8e-7). Surprisingly we find that the silhouette score does not outperform R as a fitness metric either; across problems and ML pairings, there is no significant difference in performance aside from new-thyroid. This is surprising given our hypothesis in Section 2.1 that the class label assumptions implicit in the R would make it less suited to classification with multiple labels. According to this evidence in conjunction with the lower complexity of , we opt to use as the fitness criterion for the benchmark comparison.
We find that
-lexicase survival produces more accurate classifiers than deterministic crowding, tournament and random survival across problems and ML pairings. It is significantly correlated with higher test accuracy according to a t-test (2e-16) and significantly outperforms tournament ( 0.002) and deterministic crowding ( 2.4e-7) according to all pairwise Wilcoxon tests, correcting for multiple comparisons. -lexicase survival also outperforms random survival on auto ( 4.4e-8) and new-thyroid ( 2e-16), and ties it on the other two problems (for calendarDOW, 0.094). Random survival performs strongly compared to tournament and deterministic crowding survival, outperforming those methods on 3 out of 4 problems. The results motivate our use of -lexicase survival in the benchmark comparison.
The test set accuracies of the 7 method comparisons on the benchmark datasets are shown in boxplot form in Figure 4 and the mean rankings are summarized in Figure 5. Across problems, performance varies, generally with RF, SVC, M4GP or FEW producing the highest test accuracy. Whereas FEW generally does well on the problems for which M4GP excels, FEW also does well in cases where M4GP underperforms, which is likely due to FEW’s ability to tune the ML method with which it is paired. Three problems stand out for being particularly amenable to feature engineering: GMT 2w-20a-0.4h, Hill_Valley_without_noise, and parity5+5. These three problems are well-known for containing strong interactions between features, which helps explain the observed increase in performance from FEW. In terms of mean rankings across problems, FEW generates the best classifiers among the methods tested, followed closely by SVC and RF. A Friedman test of the rankings with post-hoc analysis reveals RF, SVC, and FEW significantly outperform NB and LR across all problems (0.039).
As expected, the computation time of FEW is higher than other ML methods (see Figure 6) due to its wrapper-based approach. The quicker performance of M4GP may be explained by its c++ implementation compared to FEW’s Python implementation, as well as M4GP’s use of a consistently fast ML pairing.
We show models generated with single runs of FEW on GMT 2w-20a-0.4h in Table 4 using DT and LR. This genetics problem is generated using the GAMETES simulation tool (Urbanowicz et al., 2012). It consists of 20 attributes, 18 of which are noise, and two of which interact epistatically, meaning they must be considered together to infer the correct class (the labels contain noise as well). The models correctly identify the interaction between features 18 and 19. For this problem FEW’s transformation provides the essential knowledge required to solve this problem, whereas the ML approaches simply serve as a discriminant function for processing the information presented via the transformation.
|Decision Tree Model|
|Logistic Regression Model|
|Performance||Decision Tree||Logistic Regression|
|Initial ML CV accuracy||0.487||0.473|
|Final model CV accuracy||0.763||0.803|
6. Discussion & Conclusion
Our results suggest that FEW is a useful technique for supervised classification problems. FEW performs the best on average among the algorithms tested, which include optimized SVM, RF, KNN, M4GP, LR and NB models. This result provides evidence with these ML methods that the data representation can influence algorithm performance as much as, if not more than, the parameter settings of those algorithms. Although it hasn’t been tested here, it is likely that including hyper-parameter optimization of the ML methods paired with FEW in the tuning step would show even greater gains in performance over the baseline approach. FEW also performs better than a multiple feature GP approach (M4GP) that uses a fixed ML pairing.
Despite FEW’s runtime in these tests, a complexity analysis suggests it is well-positioned for large datasets in comparison to other feature construction techniques. Whereas techniques like polynomial feature expansion scale poorly with the number of features ( for an -degree polynomial) and techniques like kernel transformations scale poorly with the numbers of samples () (Friedman et al., 2001), FEW scales independently of the features in the dataset, linearly with , and quadratically with the population size. These observations warrant further investigation with large datasets.
This work was supported by the Warren Center for Network and Data Science at the University of Pennsylvania, as well as NIH grants P30-ES013508, AI116794 and LM009012.
- Ahmed et al. (2014) Soha Ahmed, Mengjie Zhang, Lifeng Peng, and Bing Xue. 2014. Multiple feature construction for effective biomarker identification and classification using genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM, 249–256. http://dl.acm.org/citation.cfm?id=2598292
- Arnaldo et al. (2014) Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O’Reilly. 2014. Multiple regression genetic programming. In Proceedings of the 2014 conference on Genetic and evolutionary computation. ACM Press, 879–886. DOI:http://dx.doi.org/10.1145/2576768.2598291
- Arnaldo et al. (2015) Ignacio Arnaldo, Una-May O’Reilly, and Kalyan Veeramachaneni. 2015. Building Predictive Models via Feature Synthesis. ACM Press, 983–990. DOI:http://dx.doi.org/10.1145/2739480.2754693
- Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6472238
- Breiman and Cutler (2003) Leo Breiman and Adele Cutler. 2003. Random Forests. (2003). http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Rich Caruana and
An empirical comparison of supervised learning algorithms. InProceedings of the 23rd international conference on Machine learning. ACM, 161–168. http://dl.acm.org/citation.cfm?id=1143865
- De Melo (2014) Viníícius Veloso De Melo. 2014. Kaizen programming. In GECCO ’14: Proceedings of the Genetic and Evolutionary Computation Conference. ACM Press, 895–902. DOI:http://dx.doi.org/10.1145/2576768.2598264
- Foster et al. (2015) Dean Foster, Howard Karloff, and Justin Thaler. 2015. Variable selection is hard. In Proceedings of The 28th Conference on Learning Theory. 696–709. http://www.jmlr.org/proceedings/papers/v40/Foster15.pdf
- Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics Springer, Berlin. http://statweb.stanford.edu/~tibs/book/preface.ps
- Guo and Nandi (2006) Hong Guo and Asoke K. Nandi. 2006. Breast cancer diagnosis using genetic programming generated feature. Pattern Recognition 39, 5 (May 2006), 980–987. DOI:http://dx.doi.org/10.1016/j.patcog.2005.10.001
- Helmuth et al. (2014) T. Helmuth, L. Spector, and J. Matheson. 2014. Solving Uncompromising Problems with Lexicase Selection. IEEE Transactions on Evolutionary Computation PP, 99 (2014), 1–1. DOI:http://dx.doi.org/10.1109/TEVC.2014.2362729
- Iba and Sato (1994) Hitoshi Iba and Taisuke Sato. 1994. Genetic Programming with Local Hill-Climbing. Technical Report ETL-TR-94-4. Electrotechnical Laboratory, 1-1-4 Umezono, Tsukuba-city, Ibaraki, 305, Japan. http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/Iba_1994_GPlHC.pdf
- Kommenda et al. (2013) Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, and Stefan Wagner. 2013. Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In GECCO ’13 Companion: Proceeding of the fifteenth annual conference companion on Genetic and evolutionary computation conference companion. ACM, Amsterdam, The Netherlands, 1121–1128. DOI:http://dx.doi.org/doi:10.1145/2464576.2482691
- Krawiec (2002) Krzysztof Krawiec. 2002. Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 3, 4 (2002), 329–343. http://link.springer.com/article/10.1023/A:1020984725014
- La Cava et al. (2015) William La Cava, Thomas Helmuth, Lee Spector, and Kourosh Danai. 2015. Genetic Programming with Epigenetic Local Search. In GECCO ’15: Proceedings of the Genetic and Evolutionary Computation Conference. ACM Press, 1055–1062. DOI:http://dx.doi.org/10.1145/2739480.2754763
- La Cava and Moore (2017) William La Cava and Jason Moore. 2017. A General Feature Engineering Wrapper for Machine Learning Using -Lexicase Survival. In European Conference on Genetic Programming. Springer, 80–95. https://link.springer.com/chapter/10.1007/978-3-319-55696-3_6 DOI: 10.1007/978-3-319-55696-3_6.
- La Cava et al. (2016) William La Cava, Lee Spector, and Kourosh Danai. 2016. Epsilon-Lexicase Selection for Regression. In GECCO ’16: Proceedings of the Genetic and Evolutionary Computation Conference. ACM, New York, NY, USA, 741–748. DOI:http://dx.doi.org/10.1145/2908812.2908898
- La Cava, William et al. (2017) La Cava, William, Silva, Sara, Vanneschi, Leonardo, Spector, Lee, and Moore, Jason H. 2017. Genetic Programming Representations for Multi-dimensional Feature Learning in Biomedical Classification. In European Conference on the Applications of Evolutionary Computation. Springer, 158-173. https://link.springer.com/chapter/10.1007/978-3-319-55849-3_11 DOI: 10.1007/978-3-319-55849-3_11.
Samir W Mahfoud.
Niching methods for genetic algorithms. Ph.D. Dissertation.
- McConaghy (2011) Trent McConaghy. 2011. FFX: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice IX. Springer, 235–260. http://link.springer.com/chapter/10.1007/978-1-4614-1770-5_13
- Muharram and Smith (2005) Mohammed Muharram and George D. Smith. 2005. Evolutionary constructive induction. IEEE Transactions on Knowledge and Data Engineering 17, 11 (2005), 1518–1528. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1512037
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825–2830. http://www.jmlr.org/papers/v12/pedregosa11a.html
Peter J. Rousseeuw.
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.J. Comput. Appl. Math. 20 (Nov. 1987), 53–65. DOI:http://dx.doi.org/10.1016/0377-0427(87)90125-7
- Silva et al. (2015) Sara Silva, Luis Muñoz, Leonardo Trujillo, Vijay Ingalalli, Mauro Castelli, and Leonardo Vanneschi. 2015. Multiclass Classificatin Through Multidimensional Clustering. In Genetic Programming Theory and Practice XIII. Vol. 13. Springer, Ann Arbor, MI.
- Smith and Bull (2005) Matthew G. Smith and Larry Bull. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6, 3 (2005), 265–281. http://link.springer.com/article/10.1007/s10710-005-2988-7
- Spector (2012) Lee Spector. 2012. Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion. 401–408. http://dl.acm.org/citation.cfm?id=2330846
- Tibshirani (1996) Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288. http://www.jstor.org/stable/2346178
- Tibshirani et al. (2002) Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences 99, 10 (May 2002), 6567–6572. DOI:http://dx.doi.org/10.1073/pnas.082099299
- Urbanowicz et al. (2012) Ryan J. Urbanowicz, Jeff Kiralis, Nicholas A. Sinnott-Armstrong, Tamra Heberling, Jonathan M. Fisher, and Jason H. Moore. 2012. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData mining 5, 1 (2012), 1. https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-5-16
- Žegklitz and Pošík (2017) Jan ŽŽegklitz and Petr Poššíík. 2017. Symbolic Regression Algorithms with Built-in Linear Regression. arXiv:1701.03641 [cs] (Jan. 2017). http://arxiv.org/abs/1701.03641 arXiv: 1701.03641.