Machine learning (ML) algorithms for tasks such as classification, prediction or clustering, require that training samples they are fed with are described in a compact form, in order to deliver an output in a reasonable amount of time. This description is known as representation.
Representation learning is a set of methods that allows a machine to be fed with raw data, such as images pixels, and to automatically discover the representations needed for classification or other machine learning tasks . In this regard, Deep Learning has showed to be an powerful framework for representation learning.
Deep learning methods consist in different architectures of several stacked layers of artificial neural networks. These deep neural networks (DNN) are based on the idea that a slightly more compact and abstract representation is generated in each forward layer of a DNN; thus with enough of these layers, an original representation consisting in raw data can be transformed into a representation both, compact enough to be tractable by ML algorithms, and abstract enough such that classes or clusters in data can still be discriminated by ML algorithms.
GP is an evolutionary algorithm generally used for machine learning tasks. GP has also been used for representation learning with mixed results . GP has achieved competitive results against state-of-the-art methods when the raw initial representation is composed of up to a few dozens variables (known as features). When initial representations are composed of hundreds or thousands of features, such as in image or time series problems, GP approaches to representation learning need to leverage from human expert knowledge of the problem’s domain in order to achieve competitive results, an undesirable property considering a trend towards higher degrees of automation, and unlike deep learning, which one of its most important aspects is being domain agnostic.
In this work we propose a new method that allows GP to deal with these two shortcomings. Our method is inspired in the same basic idea that motifs DNN, in the sense that we propose a multilayer GP that gradually generates more compact representations. The rest of the paper is as follows. For the rest of this section we describe the basics of GP and then we present related works to our proposed approach. In Sec. 2 we state the representation learning problem and discuss some caveats in attempting to deal with it with GP; in Sec. 3 we present our approach we call Deep Genetic Programming; in Sec. 4 we present some preliminary experimental results; in Sec. 5 we present some concluding thoughts and discuss the future line work.
GP is an evolutionary algorithm whose main difference with other evolutionary algorithms such as genetic algorithms (GA) or differential evolution (DE) 
, is that individuals in GP represent complete mathematical expressions or small computer programs, unlike GA or DE whose individuals are vectors, set of parameters or strings. GP individuals are commonly described by a tree structure. Figure1 shows an example of such a structure and the expression it represents. In a GP tree, leaf nodes, called terminals or zero-argument functions, represent input variables or constant values, while inner nodes represent very simple functions, such as arithmetic operands or trigonometric functions. The set of functions from which inner nodes can take their form has to be decided beforehand, and its called the set of primitives or simply primitives.
GP individuals are evaluated by an objective function that measures their fitness in the task they are supposed to tackle. Those individuals that perform poorly are discarded from the population and replaced by new individuals generated by performing genetic operations, such as crossover or mutation, on the so-far best performing individuals. This process is repeated a given number of generations, and these transformation and selection processes move the overall population towards promising areas of the search space where individuals perform acceptably.
GP is capable of synthetizing (or discovering) mathematical models than approximate some desired, unknown function, and thus it is potentially on par with other versatile machine learning techniques such as artificial neural networks (ANN).
1.2 Related work
Our proposed method for representation learning consist in gradually reduce the dimensionality of the initial representation. We leverage on learning autoencoder-like structures through GP to this end. Autoencoders were, originally, multilayered artificial neural networks (ANN) that attempt to copy its input to its output while data traverses a bottleneck neuron layer formed at the middle of the network. The layers of the network that comprises from its input layer up to the bottleneck it is called ”encoder”, because its task is to fit the data to pass through the bottleneck layer, with the least amount of information loss possible, while the rest of the network, from the bottleneck up to the output is called ”decoder”, because its task is to decompress the data from its compressed representaton at the bottleneck layer back to its original form.
Training autoencoders can be a method useful for representation learning , and it was one of the original methods for training DNN . After training an autoencoder with the dataset we are interested in generating a new representation for, we can discard the decoder part, and keep the encoding part that generates a compact representation for the data.
We test our method on image datasets. Examples of representation learning for image processing with GP are [17, 15]; however these kind of works rely on GP individuals that process images as a whole in each primitive function; primitives might be image addition or substraction, but normally they also tend to be specialized image filters. The GP searches for a combination and workflow of these type of functions that render a new representation useful for classification or detection tasks. The availability of these highly specialized primitives to the GP are a form of human expert knowledge brought to the system by the designer. In contrast, our proposed method process the images at individual pixel level, and only simple arithmetic and generic trigonometric functions form the set of primitives.
There are other methods for representation learning through GP that are not image processing-specific. Limon et al.  developed a method for representation learning based on GP that utilized arithmetic operations along with few statistics measures. They tested their method on a wide variety of datasets, of different problem domains, and found that GP could learn representations that boosted the performance of a classificator when the input representation consisted in a few dozen features, but beyond that (one hundred or more features) other representation learning methods or the classificator alone performed better. Similarily, Lin et al.  proposed an approach to binary classification based on generating multiple layers of representations through GP. Although superficially similar to our proposed method, their approach vastly differs from ours: each new representation layer contained a single new feature compared with the previous representation layer from which was generated, the rest of the features are taken directly from such previous layer, whereas our approach aims at generating a representation conformed of completely new feautures in each layer. Their method was in fact developed to tackle problems where initial representatios consist of just a few feautures and adding new features might be beneficial, whereas our method is designed to treat large-scale problems where we actually want to reduce the dimensionality of the initial representation.
2 Problem statement
We pose the problem of representation learning as one of dimensionality reduction. From a dataset described in representation , with an arbitrary number of samples, each sample represented by feature vector , we wish to learn a new, more compact and abstract, representation , such that each sample is now represented by feature vector such that .
2.1 Straightforward GP approach
One way to attempt to generate representation through GP would be as follows: for every feature variable that composes we set GP to find a function , such that set of input variables of is the set of feature variables that compose the representation , and that .
The problem with this approach is that it poses an untractable search space for GP. Let us suppose, without loss of generality, that each GP tree that represents is built from -arity primitives, i.e. each is a perfect binary tree. Since tree could (1) require, conceivably, all original features to generate , (2) is a perfect binary tree, and (3) input features can only go in leaf nodes of GP trees, then the height of tree is, approximately at least, , and the number of internal nodes is, approximately, . For simplicity, let us assume for now on that is a power of 2, therefore the number of internal nodes of is . Now let us suppose that we will use a set of elegible primitives, then the total size of the search space the GP needs to explore is .
This is an optimistic, lower bound estimate, since we are not yet taking into account that constants are probably needed as leaf nodes as well; but then again, this estimate shows us the complexity of the problem we are dealing with.
Suppose we wish to process a set of images to convert them from a original feature space of 64x64 gray scale pixels to a vector of 32 new features. Hence, and . We are set to search for a GP individual composed of 32 trees; each tree, potentially, of height 12. Suppose we are considering the following set of low level functions . The GP needs to search for an optimal individual among, at least, distinct possible solutions.
Although evolutionary algorithms are ideally suited to explore search spaces of such exponential growth, we propose that there might exist additional steps to the standard GP that can be taken to improve its efficiency.
3 Deep Genetic Programming
The problem of representation learning defines intractable search spaces. Using a standard GP approach to tackle representation learning would be computationally expensive. Moreover, an important number of poor solutions could be generated and the GP would need to deal with them.
We propose that considering a structural layered processing of the likes of Deep Learning will allow to significantly improve GP performance to tackle representation learning while reducing the computational burden. The idea is as follows: instead of attempting to build GP trees that convert from representation to representation in a single step, we generate a series of intermediate representations that allow to gradually go from representation to representation .
We call this approach Deep Genetic Programming (DGP).
3.1 Construction of Layers in DGP
Starting form initial representation , and in order to generate intermediate representation , the set of features that compose is partitioned into small subsets , such that , . Representation is also partitioned into small subsets , such that , .
Each feature is generated by GP tree , whose leaf nodes can be feature variables taken only from subset as well as constant values; in other words, represents function , such that , and .
Each layer is built in the same fashion as , relying on the partitioned set of features in , up until . In this way, the process of dimensionality reduction is done in a gradual manner, unlike in a straightforward single-step approach.
However, the real key of to succesfully leverage on the proposed layered approach is the partitioning of the layers, as the preliminary results we present in Sec. 4 show. In the following section we will also present a simple strategy for the construction of such partitions.
4 Experimental Results
In this section we present some preliminary results of our proposed approach. We implement a 1-layer DGP and compare it against a straightforward GP. The motivation behind these experiments is to serve as a proof of concept of DGP as a well as show the implementation and relevance of its key element, the partitioning of the input representation.
The process of representation learning is performed by the GP itself. The best performing individual that outputs the GP after the last generation serves as a feature extraction engine, that can take as input a sample in its original representation and returns as output the sample in the new representation. Notice that the GP evolves such feature extraction engine and at the same time it also discovers the features it generates as output, i.e. the actual representation learned (because it is not defined beforehand). Therefore the GP attempts to, symbiotically, learn a new representation and the mathematical functions that can generate it.
However, a problem arises when we try to define a way to evaluate the learned, more compact, representation. The answer to this problem is not trivial . In this work, we tackle the problem through the use of a decoding mechanism that reverses the compact representation to its original form. The average discrepancy between some given dataset in its original form and its the reconstructed version (returned by the decoding mechanism) provides a simple way to evaluate the learned encoded representation, in terms of abridge of information.
Therefore, we set the GP to discover three elements: and encoding mechanism that compacts the input representation; the compact representation itself, i.e. the set of features that conform it; and a decoding mechanism, that provides a supporting role in order to evaluate the learned representation (and the mechanism that generates it). Together, encoder and decoder, form what is known as an autoencoder.
4.1 Individual Representation
The individuals design consist in two forests of GP trees connected through a bus of data. One forest is the encoding mechanism, the other forest is the decoding mechanism and the data bus is the new representation for the data. Fig. 2 illustrates this concept.
Given a dataset comprised of an arbitrary number of samples, each one described by the same features, we wish to reduce the features to new features, losing the least possible amount of information. That is, from the learned features, it should be possible to reconstruct samples to the original features.
The encoder will be comprised of a forest of GP trees ; the decoder will be comprised of a forest of GP trees . Each tree will generate feature of the compact representation , and each tree will generate feature of reconstructed representation .
Terminals of each tree can only be features of representation (as well as constant values within some range), i.e. none of these trees can see any of the original features. Similarly, terminals of tree can only be taken from the original representation, and they cannot look ahead for features from representation they are constructing.
As stated before, our GP individual comprehends both encoder and decoder forests integrated as a single indivisible unit we call from now on autoencoder. These autoencoders never get their encoder and decoder components separated during the evolutionary process, but they exchange small bits of their structure with other autoencoders in a GP population through the evolutionary operators described in the following subsection.
4.2 Evolutionary parameters and operators
We will construct a set of randomly generated GP autoencoders individuals, that will constitute an initial population, and through a GP evolutionary process, we will search for an autoencoder that maximizes the average similarity between each sample and its reconstruction, across an entire training dataset. Table 1 shows the set of GP evolutionary parameters used across all experiments performed.
In each generation, only half of the population is chosen through binary tournament to make it to the next generation. From this half of the population, new individuals are generated with an probability of crossover and a probability of mutation. Thus, given that population size is fixed to 60 individuals, 30 of them are directly taken from the previous generation, 18 are generated from crossover and 9 are generated through mutation. The remaining 3 individuals are chosen directly from previous generation through elitism.
|Max. Tree depth||4|
|Set of Primitives|
|No. Generations||See Sec. 4.4|
Given that individuals are comprised of two set of trees, special crossover and mutation operations had to be defined that differ from the commonly found in GP literature that deals with single tree GP-individuals.
In the case of crossover, when two individuals are selected to undergo crossover, randomly selected trees in the encoder forests are exchanged between the two individuals, then the same process is executed again but now for trees in the decoder forests. When an individual is selected for mutation, one randomly selected tree in each its decoder and encoder forests are deleted, these trees are replaced by new randomly generated ones. Notice how none of these operators operate at node level.
4.3 Objective Function
To determine similarity between an original sample and the reconstructed output from the autoencoder, we used the mean square error (MSE), defined in Eq. 1. MSE receives as input original sample and reconstructed vectors, and compares them feature by feature, averaging the difference across all of features. MSE output can be thought as a distance between a sample and its reconstruction.
4.4 Experimental setup
We implemented and tested three variants of GP algorithms. The first setup consist in a straightfoward GP as described in Sec. 2.1. In this setup, terminals of each tree can be any feature from the entire set of original features. Analogously, terminals of each tree can be any of the features from the compact representation . This is our control setup, and its result will serve as a baseline for comparison purposes with the proposed approach.
For the second setup we implemented a 1-layer DGP (1-DGP) along with its partitioning scheme in the following way: we split the initial input features into subsets . Associated to each subset there is a subset of features from , such that each feature is generated by . Tree can only see the small subset of feautures in Four features from input representation are assigned to each subset .
Mirroring this configuration, the decoding forest is also partitioned in subsets of four trees, such that trees in each subset can only see the subset of three features of some subset . All these subsets, (4) input features-(3) encoding trees-(3) new features-(4) decoding trees, are coupled together, to form a miniautoencoder. Fig. 3 contrast this setup against the straightforward GP.
In our third experimental setup we use a minibatch based form of training. Instead of presenting the entire dataset to each individual every generation, only a very small, variable, subset of samples (minibatch) of the dataset are presented to each individual in every generation. The number of generations for the GP is chosen such that, no. of generations size of minibatches size of complete training dataset. This way we guarantee that the population sees each sample in the training dataset at least once. The minibatches conform a partition, in the mathematical sense, of the entire dataset. The purpose of this setup is to dramatically reduce the computational cost of the GP algorithm, in a similar vein to DNN’s usage of stochastic gradient descend.
The objective function in the first two setups described is to minimize the average MSE across all pairs sample-reconstruction from some given dataset. The evolutionary process is executed for 40 generations in both setups. In every generation, each individual of the population is tested against the entire training dataset, the resulting MSEs for every instance in the dataset are averaged, and this result is assigned as the fitness for a given individual. On the other hand, the objective function in the third setup is minimizing avergae MSE for all samples in the current minibatch presented to the population.
We carried all experiments on grayscale image datasets and each pixel is an input feature. We use every four neightboring pixels in the same row to be in subset .
We compared the three approaches in ML dataset MNIST . For the third setup we also performed a test varying the size of the minibatches and increasing the number of generations in order to allow the GP to see each sample more than just once. After we calibrated the size of minibatches and confirmed that giving more than one pass over training data was beneficial to the proposed method, we further tested it onto two additional ML datasets, namely LFWcrop  and Olivetti . And finally compared the results with those obtained with conventional ANN autoencoders.
Each dataset was split in training and testing set. The evolutionary process is carried with the training set and the top performer individual that results from the process is tested with the testing set, composed of images never before seen during the evolutionary process. Table 2 describes the used datasets.
|Images Resolution (Input features)||28x28 (784)||64x64 (4,096)||64x64 (4,096)|
|No. Training samples||60,000||12,000||360|
|No. Testing samples||10,000||1,233||40|
Fig. 4 shows the results obtained by the straightforward GP, the 1-Layer DGP and the minibatch training version of it; minibatches were composed of 100 samples. All experiments were done in a workstation with an Intel Xeon CPU with 10 physical cores at 2.9 GHz, with two virtual cores per each physical core, to amount for a total of 20 processing threads, 16 GB of RAM, running Ubuntu Linux 16.04. Algorithms implementation and setups were done by using a in-house software library developed in Python version 3.6. Accelerated NumPy library is used only in the final step of fitness evaluation (averaging the MSE of all sample-reconstruction pairs) of each individual. The straightforward GP can make use of the multiple processing cores by parallelizing the evaluation of sample-reconstructions pairs. On the other hand, both 1-Layer DGP approaches distribute evolution of multiple miniautoencoders across most (but not all) processing threads available, and in this case the evaluations of the sample-reconstructions pairs is done sequentially for each miniautoencoder.
A visual depiction of the performance of the synthesized autoencoders is shown in Figures 5 and 6. Fig. 5 shows the gradual increase in performance obtained by the 1-Layer DGP through the evolutionary process. Fig. 6 compares the reconstruction for the first ten images in the training set, as obtained by the best autoencoders generated by the three different experimental setups.
Table 3 shows results of varying the size of minibatches to 30, 60, 100, 300, and 600 samples; as well as allowing the algorithm to give one, two and five forward passes over the training data.
|Mini Batch Size|
We picked up the best performing setup from our minibatch study (minibatches of size 60) and compared its performance against a one hidden layer, fully connected, Multilayer Perceptron (1-MLP) set up as an autoencoder that performs the same compression ratio. We implemented the 1-MLP in TensorFlow Deep Learning library. We also set the size of the minibatches of the 1-MLP to 60, just as in our GP approach. Table 4
shows the results of each approach, for the different amounts of forward passes/epochs. The main idea behind these experiments is to make a 1-to-1 ”layer” and ”epoch-to-epoch” comparison in order the study the behavior of the new proposed method and contrast it with conventional methods of deep learning. Fig.7 show a visual appreciation on the reconstructions generated by both autoencoders for LFWcrop and Olivetti samples.
From the results presented in the previous section we can appreciate that the proposed partitioning approach for DGP is one order of magnitude better than a straightforward GP approach. In fact, the straightforward approach does not reach an acceptable solution at all given approximately the same amount of time, making further clear the advange of DGP layer construction process. Even though the difference, in terms of quality of solutions, is not as decisive between the 1-DGP and the its minibatch learning version, the difference in execution time between them is also one order of magnitude. Regarding this last approach, results show that size of minibatches have to be carefully selected in order to get a balance between quality of solution and execution time. We also confirmed that the 1-DGP minibatch technique can benefit from making several passes over the training data.
When compared with a conventional autoencoder, results show that GP beheaves quite different from ANN. GP can quickly (in terms of passes over the training data) build acceptable encoding-decoding models, while an ANN that attempts to generate a representation of 3,072 (588) features from 4,096 (785) initial features, in the case of LFWcrop and Olivetti (MNIST) datasets, has just too many parameters to adjust. This effect is further noticeable as we test datasets with fewer training samples. The GP is simply a more efficient approach in terms of training data usage.
Nevertheless, our GP approach is still behind ANN in terms of total execution time. However this has more to do with the current state of the software implementations of each. While the GP was implemented in pure Python an is prototype grade, TensorFlow is a highly optimized, mature, enterprise grade library.
5 Concluding remarks and future work
In this work we have introduced a new method that allows GP to perform representation learning on large-scale problems without the need of specialized primitives. We provided experimental results that prove the performance gains when compared against a straightforward GP approach.
There is no reason to believe that this method, applied iteratively, can yield representations compact and abstract enough to be usable for other machine learning tasks such as classification or decision taking, and therefore attain results previously though unreachable for GP.
Some of the inmediate questions that arise are:
Does subsets have to form a partition? Probably not. In fact, it could be better if some features belong to more than just one subset .
What other strategies could be used to build subsets ? So far we have presented a simple strategy based on picking (spatially) neighboring features to be in the same subset but other more complex strategies could exist.
How the evolutionary parameters tax on the performance of the algorithm, including the set of primitives available? A complete study is required varying the probabilities of genetic operators, population size, number of available primitives, etc.
Why the minibatched version is as good, if not better, in terms of quality than the regular version that sees all training samples each generation? This form of semi-online training probably provides robustness to the population overall, but a more comprehensive answer to this questions is desired.
There is still quite a road ahead before we can answer whether or not this method can be competitive with modern state-of-art deep learning methods. The most important question being if adding more layers to a DGP will still keep the upperhand on its side when compared with multilayered ANN. Nevertheless we beleive that many of the necessary ingredients are already there for a GP-based deep learning framework to emerge.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
-  Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
John H Holland.
Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992.
-  John R Koza. Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press, 1992.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
-  Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788, 1999.
-  Mauricio García Limón, Hugo Jair Escalante, Eduardo Morales, and Luis Villaseñor Pineda. Class-specific feature generation for 1nn through genetic programming. In Power, Electronics and Computing (ROPEC), 2015 IEEE International Autumn Meeting on, pages 1–6. IEEE, 2015.
Jung-Yi Lin, Hao-Ren Ke, Been-Chian Chien, and Wei-Pang Yang.
Designing a classifier by a layered multi-population genetic programming approach.Pattern Recognition, 40(8):2211–2225, 2007.
-  Sebastian Mika, Gunnar Ratsch, Jason Weston, Bernhard Scholkopf, and Klaus-Robert Mullers. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop., pages 41–48. IEEE, 1999.
Ferdinando S Samaria and Andy C Harter.
Parameterisation of a stochastic model for human face identification.
Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on, pages 138–142. IEEE, 1994.
-  C Sanderson. Lfwcrop face dataset, 2014.
-  Ling Shao, Li Liu, and Xuelong Li. Feature learning for image classification via multiobjective genetic programming. IEEE Transactions on Neural Networks and Learning Systems, 25(7):1359–1371, 2014.
Rainer Storn and Kenneth Price.
Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces.Journal of global optimization, 11(4):341–359, 1997.
Leonardo Trujillo and Gustavo Olague.
Synthesis of interest point detectors through genetic programming.
Proceedings of the 8th annual conference on Genetic and evolutionary computation, pages 887–894. ACM, 2006.
-  Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.