Automated Problem Identification: Regression vs Classification via Evolutionary Deep Networks

07/03/2017
by   Emmanuel Dufourq, et al.
0

Regression or classification? This is perhaps the most basic question faced when tackling a new supervised learning problem. We present an Evolutionary Deep Learning (EDL) algorithm that automatically solves this by identifying the question type with high accuracy, along with a proposed deep architecture. Typically, a significant amount of human insight and preparation is required prior to executing machine learning algorithms. For example, when creating deep neural networks, the number of parameters must be selected in advance and furthermore, a lot of these choices are made based upon pre-existing knowledge of the data such as the use of a categorical cross entropy loss function. Humans are able to study a dataset and decide whether it represents a classification or a regression problem, and consequently make decisions which will be applied to the execution of the neural network. We propose the Automated Problem Identification (API) algorithm, which uses an evolutionary algorithm interface to TensorFlow to manipulate a deep neural network to decide if a dataset represents a classification or a regression problem. We test API on 16 different classification, regression and sentiment analysis datasets with up to 10,000 features and up to 17,000 unique target values. API achieves an average accuracy of 96.3% in identifying the problem type without hardcoding any insights about the general characteristics of regression or classification problems. For example, API successfully identifies classification problems even with 1000 target values. Furthermore, the algorithm recommends which loss function to use and also recommends a neural network architecture. Our work is therefore a step towards fully automated machine learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

06/04/2021

A novel multi-scale loss function for classification problems in machine learning

We introduce two-scale loss functions for use in various gradient descen...
06/22/2018

Combination of Domain Knowledge and Deep Learning for Sentiment Analysis

The emerging technique of deep learning has been widely applied in many ...
09/26/2017

EDEN: Evolutionary Deep Networks for Efficient Machine Learning

Deep neural networks continue to show improved performance with increasi...
11/03/2019

Generalized Learning with Rejection for Classification and Regression Problems

Learning with rejection (LWR) allows development of machine learning sys...
08/21/2017

Deep vs. Diverse Architectures for Classification Problems

This study compares various superlearner and deep learning architectures...
02/05/2020

Solving Raven's Progressive Matrices with Neural Networks

Raven's Progressive Matrices (RPM) have been widely used for Intelligenc...
02/11/2020

Neuroevolution of Neural Network Architectures Using CoDeepNEAT and Keras

Machine learning is a huge field of study in computer science and statis...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As the performance of machine learning algorithms has skyrocketed over recent years the often unspoken relationship between the human data scientist and the machines they run has evolved significantly. A great deal of work has been put into new state-of-the-art methods, and researchers are constantly optimising the various aspects of machine learning algorithms. Such efforts include proposing algorithms for optimising hyperparameters and network architectures

(real:2017:large, ) and the latest trends show increasing emphasis on algorithms that require less human intervention. Consider the automatic statistician project 111https://www.automaticstatistician.com/index/ which aims at removing the data scientist from the process of understanding data by using Bayesian model selection. Real et al. (real:2017:large, ) propose an evolutionary algorithm for optimising image classification neural networks which requires no human intervention in creating the networks. Similarly, Zoph and Le (zoph:2016:neural, )

use recurrent neural networks along with reinforcement learning in order to achieve a similar goal. It is clear from these research efforts that this is a trend that will continue, driven both by potential industrial profits to compensate for shortages of expensive data scientists and by the general goal of Artificial General Intelligence (AGI).

Nevertheless, for most current machine learning algorithms, there is a considerable amount of human intervention which must be performed prior to the final execution of the algorithm. For example, setting the number of parameters, preprocessing the data, deciding on the loss function and interpreting the results, to name a few. Another example, and perhaps the first of the steps in the data science process, is problem identification:

”does a supervised set of data correspond to a classification or regression problem?” Understanding which type of the two problems a given dataset represents is a step in the direction of automated machine learning research and is the subject of this study.

Figure 1.

Each chromosome contains four genes of which one gene represents a network architecture. The figure illustrates an example of a network architecture generated by an API chromosome (which was obtained at the end of an execution of the API algorithm). The input dataset was CIFAR-10 – an image classification dataset. The chromosome recommended that the last layer should have 10 units and that these should use the sigmoid activation function. Furthermore, the chromosome recommended using the categorical cross entropy loss function, and consequently, correctly determined that the dataset was a classification problem.

Classification problems typically represent a set of problems whereby the goal is to create a predictive model that can discriminate between various known classes. CIFAR-10 and MNIST are examples of classification datasets where the goals are to identify the correct label (airplane, automobile, bird, cat etc… and digits respectively) for each image. For regression problems, the predictive output is continuous (as opposed to discrete in the case of classification). An example of a regression dataset is the Boston housing price regression dataset for which the goal is to predict the median value of the houses.

In the context of deep learning (lecun:2015:deep, ), when presented with a dataset, typically one will verify whether the data represents a classification or a regression problem, and then will decide on the loss function and network layers accordingly. For the CIFAR-10 image dataset, one might consider using convolutional, dropout and fully connected layers; and for the Boston housing price dataset one might use fully connected and dropout layers. Furthermore, a decision should be made with regards to which loss function (or equivalently, figure of merit) to use. For CIFAR-10 one might use categorical cross entropy, and use the mean squared error loss function for the Boston housing case. As researchers in machine learning, in most cases, these decisions can be made with relative ease. For a machine, on the other hand, this decision is non-trivial and current machine learning algorithms do not automatically decide if a given dataset is a classification or a regression problem; nor do they recommend a loss function.

In this study, a genetic algorithm (GA) harnessed to a dynamic and flexible deep learning framework is proposed for the automated identification of problems. We call this the Automated Problem Identification (API) algorithm and show that it can successfully determine if a dataset is a classification or a regression one; and furthermore, recommend whether to use categorical cross entropy or mean-squared error. Additionally, API will recommend which layers (e.g. convolutional or fully connected) – from a known set – to use, either as the final architecture or as the input to further optimisation. Figure

1 illustrates an example of a network which was produced by a chromosome when the CIFAR-10 dataset was input into API. The resulting architecture is very similar to one that a human might use for the problem.

This paper is organized as follows: Section 2 describes GAs. Section 3 describes the API chromosome which is used to determine if a dataset is a classification or a regression optimisation problem. Section 4 provides the details for the proposed API algorithm. The experimental setup is presented in Section 5 and Section 6 discusses the results. We conclude in Section 7 and discus our future work.

2. Genetic Algorithm

A Genetic Algorithm (GA) (Goldberg:1989:GeneticAlgorithms, ) is a biologically inspired evolutionary algorithm (Eiben:2003:IntroTo, ). GAs mimic the way that species fight for survival and reproduce in nature. A GA makes use of a population of chromosomes to solve an optimisation problem. Each chromosome encodes a potential solution to the problem. Over time the chromosomes undergo many modifications, known as genetic operators, in order to traverse the search space. A fitness function is used to determines how good a chromosome is at solving the optimisation problem. Each generation parent chromosomes are selected and genetic operators are applied to those parents to create offspring which then constitute the new chromosome – and parent – population. The new population is evaluated for fitness and the process is repeated; as illustrated in Algorithm 1.

input : generation_max: maximum number of GA generations
1 begin
2       Create an initial population of chromosomes. Evaluate the initial population. generation . while generation generation_max do
3             generation Select the parents. Perform the genetic operators. Replace the current population with the new offspring created in step 8. Evaluate the current population.
4      return The best chromosome.
5
Algorithm 1 Genetic algorithm

3. Proposed API Chromosome

In this section and the following subsections, we describe the API chromosome along with a description about each of the genes within the chromosome. In this study, the word layer refers to the layers in deep neural network architectures. Each chromosome is made up of four genes, namely, the neural network loss function, the number of units in the last layer of the neural network, the activation function used in the last layer and the configuration of the layers (configurations are explained in section 3.4). A chromosome thus encodes an entire deep neural network architecture and an associated loss function.

Figure 2

illustrates an example of an API chromosome that encodes a neural network architecture with the following layers: fully connected, dropout and two fully connected layers. Furthermore, the chromosome will apply the mean squared error loss function (during the training of the neural network) and the last layer has 1 unit of which the activation function is a rectified linear unit. The following subsections provide additional details about the four genes.

We chose to use GAs since the number of genes can easily be modified in order to encode additional complexity and to easily handle the discrete nature of the parameters being chosen, since API searches through a space of network architectures in addition to other parameters. We can increase the complexity of the chromosomes by including more parameters, as we discuss in section 7.

Figure 2.

Example of an API chromosome which encodes the mean squared error loss function, 1 unit in the last layer of the network which has the relu activation function. The architecture of the network, denoted as [1, 2, 1, 1] represents a fully connected layer, followed by dropout and two fully connected layers. The configurations are explained in section

3.4.

3.1. Loss Function

This gene represents the loss function that will be used when training the network and it takes on two possible values: Mean Squared Error (MSE) and Categorical Cross Entropy (CCE) loss. Let denote the target label for sample , denote the model’s predicted output for sample and denote the number of training samples. The mean-squared error used in this study is presented in equation 1.

(1)

For example, let and assume that some network predicts then . Similarly, assume that another network predicts then . Network is preferred since . When using the MSE, the objective of an optimisation algorithm will be to minimise the MSE value to reduce the distance between the correct values and the model’s predicted values.

The categorical cross entropy used in this study is presented in equation 2.

(2)

When using this loss function the objective is to maximise the CCE in such a way to make the network predictions are as similar to the labels as possible. In this case, the target labels will be represented as a vector (often one-hot encoded vectors) and the network predictions will also be in a vector of the same length. For example, let

and assume that some network predicts then CCE for sample is . Similarly, assume that another network predicts then for sample , CCE . Network is preferred since .

3.2. Number of Units in Last Layer

The second gene denotes the number of units in the last layer in the network. There are two possible values for this gene: one or , where denotes the number of unique values in ( represents the target values for a dataset). Formally, , where and is the number of samples. For example, assume that for some dataset then since there are 4 unique values in the targets.

3.3. Last Layer Function

This gene takes on four possible values and denotes which activation function will be used in the last layer in the network. The possible values are: {linear, relu, sigmoid, softmax}. Here ‘relu’ refers to rectified linear units. Given some input to a layer, the equations for each of the activation functions are presented in equations 3 to 6 respectively.

(3)
(4)
(5)
(6)

where

dimension of
1, …, D

3.4. Configuration of Layers

Each chromosome has a gene which corresponds to the architecture of the network which we define as the configuration. The configuration represents the exact sequence of the network layers and is stored in a list. The first element in the configuration represents the first layer and the last element represents the last layer. There are four possible values which each element in the configuration can take, namely: convolution (lecun:1989:generalization, ), fully connected, dropout (Srivastava:2014:Dropout, )

and max pooling

(Zhou:1988:Computation, ). Here, convolution refers to two-dimensional convolution. We add dropout to the list of possible configuration values even though dropout is not a layer.

The size of the configurations is randomly selected between 5 and 15. The configurations are initialised randomly during the initial population generation and modified during the mutation operator; these are explained in sections 4.1 and 4.3.1 respectively. Each of the layers are mapped to an integer value, i.e. convolution is mapped to 0, fully connected to 1, dropout to 2 and max pooling to 3. Each chromosome has exactly one configuration.

We provide the following example to illustrate the configurations. Let the configuration for a chromosome be: [2, 0, 3, 3, 0, 0, 1, 2, 1, 1]; figure 1 illustrates this network. The network is comprised of several convolution and max pooling layers followed by fully connected and dropout layers.

3.5. Chromosome Fitness Evaluation

GAs make use of a fitness function to evaluate how good a chromosome is at solving an optimisation problem. In our case, we designed a fitness function to discriminate between classification and regression problems. When the proposed system commences, it splits the dataset into two subsets, the features, and the labels . The labels are then converted into their corresponding one-hot encoded values. For example, if a label has a value of 2 and the unique values are {0, 1, 2} then the one-hot encoded value of ‘2’ is [0 0 1]. The system retains both and the one-hot encoded values. The dataset is split such that 50% of the data is in the training set and the remaining in the validation set.

Each chromosome is evaluated as follows. The chromosome’s loss function is used to train the neural network on the training data. If the chromosome’s loss function was categorical cross entropy, then the one-hot encoded values are used during training. However, if the loss function was mean squared error, then the values are used during training.

The validation loss is recorded during the optimisation of the neural network across the epochs. Let the validation loss be

where denotes the total number of epochs and denote the validation loss for epoch . We define the change in validation as follows, Finally, we define the ratio in validation drop, , as Thus, for each chromosome, after the optimisation of the neural network has taken place, we compute , and if then this implies that the network has not done any learning since the validation loss increased. Furthermore, if then once again the network has not managed to learn anything since the validation loss has remained constant over the epochs. Finally, if we conclude that given the drop in validation loss, that the network has managed to learn.

The model then predicts the output values on the validation data. The predictions and the validation target values are compared using mean squared error. The loss obtained on the validation data using categorical cross entropy will be different to the loss computed using mean squared error. We chose to use the mean squared error to be consistent with the comparisons. In the case whereby the network has not learnt anything we penalise the chromosome with a fitness of infinity. However, in the case whereby the network has learnt, i.e. , then we assign the computed validation mean squared error as the fitness of the chromosome. The objective of the API algorithm is thus to minimise the fitness of each chromosome by rewarding networks that learn and have a small mean squared error on the validation set. The lower the fitness value the better a chromosome performed.

Figure 3 illustrates a plot which explains the fitness of a chromosome. The plot is separated in two where . From the plot, it is observed that when then the fitness is set to a very large value. When then the value of the fitness corresponds to the mean squared error whereby a smaller value is better. For the sake of the example, a straight line was drawn for to illustrate that a smaller mean squared error results in a better chromosome fitness.

Figure 3. During the training of each neural network on the validation data we record the validation loss. We then determine whether or not the network has learnt. If the network has not learnt () then we penalise the chromosome with a very large fitness. However, if the network was able to learn () then we assign the mean squared error as the fitness value.

Since the chromosomes are randomly generated, it is possible that they represent invalid networks for particular features and labels on a given dataset. For example, assume that, for some chromosome, the number of units in the last layer is 1 and the categorical cross entropy loss function is used. Given the previous description in this subsection, the one-hot encoded values should be used during training. However, in the example, the number of outputs is 1 and thus a one-hot encoded vector cannot be compared to a single value. To illustrate with another example, consider a chromosome that tries to use convolutional layers on a feature based regression dataset - this is, of course, not feasible. Invalid chromosomes such as this are penalised with a fitness of infinity. Section 4.4 describes how the chromosome makes the discrimination between regression and classification.

4. The API Algorithm

The following subsections explain how each aspect of the GA has been adapted to determine if a given dataset is a classification or regression problem. Furthermore, the algorithm recommends the following upon termination: the loss function which should be used in order to enable the training of a neural network, the number of units and type of activation function in the last layer and finally, a simple network architecture is also recommended. The API algorithm performs optimisation in two phases, namely in optimising the GA population, and since each chromosome represents a neural network, optimisation is performed when training the networks.

4.1. Initial Population Generation

The initial population size is set to the same value as the user-defined population size. Suppose the population size is n, then n chromosomes are created during the initial population generation. Each chromosome has a fixed length of 4 genes (discussed in section 3). During the creation of a chromosome, each gene is randomly created based on the values each gene can take on. The pseudocode for creating a chromosome is presented in algorithm 2. The initial fitness of each chromosome is set to infinity.

1 begin
2       Initialise an empty chromosome. Set the loss function to either categorical cross entropy or mean squared error. Set the number of units in the last layer to either one or . Set the activation function in the last layer to either linear, sigmoid, softmax or relu. Create a random configuration.
3
Algorithm 2 Creating a chromosome.

4.2. Parent Selection

Parent selection methods are used to obtain parents from the current population of chromosomes. These parents are used by the genetic operators to create offspring. A single parent is obtained when the parent selection method is executed. Once a chromosome has been chosen to be a parent, the selection method can select that particular chromosome again. Three common parent selection methods are fitness-proportionate, rank and tournament selection (Blickle:1996:AComparison, ). For this study, tournament selection was used given that it was shown to be a successful method by Zhong et al. (Zhong:2005:ComparisonOfPerformance, ).

Algorithm 3 presents the pseudocode for the tournament selection. This selection method has one user-defined parameter, namely, the tournament size. Let k be the tournament size. Tournament selection randomly selects k chromosomes from the current GA population, and compares the fitness of each of the k chromosomes. The chromosome with the lowest fitness is returned as the parent chromosome. If a tie occurs, then a random chromosome is selected to break the tie.

input : size: size of the tournament
output : The best chromosome which will be used as a parent
1 begin
2       current_best null for  to size do
3             random_chromosome randomly select a chromosome from the population Evaluate random_chromosome if fitness of random_chromosome fitness of current_best then
4                   current_best random_chromosome
5            
6      return current_best
7
Algorithm 3 Pseudocode for tournament selection.

4.3. Genetic Operators

Genetic operators are applied to parents to exchange genetic material between the parent chromosomes, and to consequently create novel offspring. The two most common genetic operators are mutation and crossover. Their implementation details for this study are described below.

4.3.1. Mutation

The mutation genetic operator makes use of a single parent chromosome. During the execution of mutation, a gene is randomly selected and a new value for that gene is created. A user-defined parameter is associated with the mutation operator, namely the mutation application rate. Figure 4 illustrates the application of the mutation operator on a parent chromosome, and the resulting offspring is illustrated. From the example, the forth gene was selected for mutation and thus the forth gene within the parent was changed from a configuration of [1, 2, 1, 1] to [1, 1, 1, 1, 1] in the offspring.

Figure 4. Example of the mutation operator being applied to a parent chromosome. The forth gene was selected for mutation and consequently a new configuration was generated for the offspring.

4.3.2. Crossover

The crossover genetic operator exchanges genetic material between two parent chromosomes: and , and consequently creates two offspring: offspring and offspring. There are several variations of the crossover genetic operator, such as uniform, one-point and two-point crossover.

The crossover method we implement randomly selects a position p in the range — where n denotes the length of the chromosome — within the parent chromosomes; the same position p must be selected within the two parents. Two offspring are created, and all the genes except those at position p are copied across to the corresponding offspring without modification. The genes are position p are swapped, i.e., the gene in position p from is inserted into position p in offspring, and similarly, the gene in position p from is inserted into position p in offspring.

An example of the application of the crossover operator is presented in figure 5. The figure shows two parent chromosomes. The crossover point was the third gene from each parent, i.e. the last activation function was swapped between the parents. The offspring show the result of the crossover.

Figure 5. Example of the crossover operator being applied two parent chromosomes. The third gene from both of the parents were selected for crossover. As a result, the last activation functions were swapped between the parents.

4.4. Algorithm Termination and Final Decision

At the end of the generational loop, the best chromosome is output. The loss function in the best chromosome is then used to decide if the dataset was a classification or a regression problem. If the loss function was categorical cross entropy, then the problem was labelled as classification. However, if the loss function was mean squared error then the problem was labelled as regression.

5. Experimental Setup

This section describes the experimental set up which was used to evaluate the performance of API. The algorithm was programmed in Python 3.6.0 and TensorFlow 1.1.0 (tensorflow2015-whitepaper, ). The algorithm was evaluated on a machine with an Intel Core i7-6700K CPU and 16GB RAM.

5.1. Datasets

Table 1 presents the 16 datasets which were used in this study along with their characteristics and type. All of the datasets were obtained from the UCI machine learning repository (Lichman:2013, ) except for CIFAR-10 and CIFAR-100 which were obtained from (krizhevsky:2009:learning, ), MNIST from (LeCun:1998:Gradient, ), CrowdFlower 222https://www.crowdflower.com/data-for-everyone/, Aloi 333https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html and IMDB 444https://keras.io/datasets/

were obtained externally. In this study, CrowdFlower represents the ‘emotion in text’ dataset. The assumptions are that there are no missing values in each dataset and that categorical features are converted to corresponding numerical features using a one-hot encoding approach. Of course, it would be possible to implement an imputation method

(mcknight:2007:missing, ) to overcome datasets with missing values, however, this was not part of the scope of this study. The algorithm standardises each feature. Where possible, we used 1000 samples for training and 1000 for validation. Boston housing, for example, did not have that many samples. In this case, we simply split the dataset equally into two sets. We distinguish between data and image classification problems because in the former the data are typically resented by one-dimensional vectors; whereas, image classification datasets are commonly represented as three-dimensional arrays. API can adapt to the various input shapes without human intervention.

Dataset Features Unique Targets Type
Aloi 128 1000 D
Isolet5 617 26 D
Letter Recognition 16 26 D
Sensorless Drive 48 11 D
Year Prediction 90 89 D
Boston Housing 13 506 R
CCPP 4 4837 R
Concrete Comp 15 1030 R
Forest Fires 29 17380 R
Pysiocochemical 9 15903 R
Relative CT Slice 384 2001 R
CIFAR-10 3072 10 IC
CIFAR-100 3072 100 IC
MNIST 784 10 IC
CrowdFlower 1000 13 SA
IMDB 10000 2 SA
Table 1. The 16 datasets used in this study. We used datasets from four problem domains with various characteristics and are denoted as follows: ‘D’ represents data classification, ‘R’ for regression, ‘IC’ for image classification and ‘SA’ for sentiment analysis. The sentiment analysis datasets were considered as classification problems. The unique targets refers to the unique number of outputs in the target values for each dataset. For example, for CIFAR-10 has 10 unique target classes, whereas Relative CT Slice has 15903 unique target values. For CrowdFlower and IMDB we used a bags of words approach in order to generate word embeddings.

5.2. Experimental Parameters

The GA and neural network parameters used in this study are presented in tables 2 and 3 respectively. These parameters were obtained by preliminary runs of the algorithm. The purpose of this study was to evolve chromosomes that could determine whether a given dataset was classification or regression in addition to several other outputs. Certain variables had to remain fixed in order to evolve the chromosomes. Each parameter in table 3 was set to a fixed value.

Parameter Value
Crossover rate 70%
Mutation rate 30%
Number of generations 10
Tournament size 5
Population size 50
Table 2. The GA parameters used in this study. Preliminary experiments revealed that we did not need to use a large population size or a large number of generations to evolve accurate solutions.

6. Results and Discussion

The results obtained by API are presented and discussed in this section. The Aloi dataset was included in the experiments because one could hypothesise that if a dataset has a large number of targets then it is a regression dataset. For this reason, we included Aloi as it has a much larger number of classes in comparison to the other classification datasets. The accuracy results achieved by API on the 16 datasets across the 20 runs are presented in figure 6. When discriminating between regression or classification problems, API obtained an average accuracy of 96.3%, the lowest accuracy was 90% which was obtained on 3 datasets and the highest accuracy was 100% which was achieved on 7 datasets.

Figure 6. Accuracy (%) results obtained by API on the various datasets. For each dataset, 20 runs of the algorithm were executed. The lowest accuracy was 90% and API achieved 100% accuracy on 7 datasets. The average accuracy across the datasets was 96.3%.

Table 4

presents the number of times, out of 20 runs, that API incorrectly classified each dataset. There were 3 datasets for which API incorrectly classified two runs, this represents an accuracy of 90%. There was no dataset for which the performance across the runs was less than 90%.

In the case of the two misclassifications for the CIFAR-10 dataset, the fitness for chromosomes having the mean squared error and the categorical cross entropy loss function were very close. It happened to be that, for that particular run, the former had a slightly lower fitness. In the second case, the population was rapidly dominated by chromosomes having the mean squared error loss function as the generational loop progressed. A similar observation was made for the other incorrectly classified runs. Two possible ways of overcoming this issue would be to re-introduce genetic diversity into the population by randomly initialising a number of chromosomes across the generations. This would thus allow chromosomes containing both types of loss functions to be present in the population. Alternatively, increasing the tournament size could allow for weaker chromosomes to remain in the population which could in turn preserve the balance between the chromosomes containing both loss functions.

Parameter Value
Number of epochs 5
Weight initialisation - mean 0.0

Weight initialisation - standard deviation

0.01
Number of units in all layers except last 100
Activation functions in all layers except last relu
Number of filters in each convolution layer 10
Convolution filter size 2x2

Convolution strides

1
Max pooling size 2x2
Max pooling stride 1

Dropout keep probability

0.8
Learning rate 0.001
Optimiser Adam (kingma2014adam, )
Batch size 2048
Table 3. The neural network parameters used in this study. When training a neural network contained in a chromosome each of the parameters listed in this table were applied.
Dataset Type Number of Incorrectly Classified Runs
Aloi D 2
Isolet5 D 1
Letter Recognition D 0
Sensorless Drive D 1
Year Prediction D 2
Boston Housing R 0
CCPP R 1
Concrete Comp R 0
Forest Fire R 1
Physiocochemical R 0
Relative CT Slice R 0
CIFAR-10 IC 2
CIFAR-100 IC 0
MNIST IC 0
CrowdFlower SA 1
IMDB SA 1
Table 4. The table presents the number of runs for which the algorithm incorrectly classified each dataset. The objective of API was to discriminate between regression and classification datasets. For each dataset we performed 20 API runs. A perfect accuracy of 100% was achieved on 7 datasets. For the types, ‘D’ represents data classification, ‘R’ for regression, ‘IC’ for image classification and ‘SA’ for sentiment analysis.

Appendix A presents, for each dataset, an example chromosome that was evolved. These chromosomes were randomly selected from each of the 20 runs. The networks varied in size from 5 to 15 layers, however, in most cases the networks were deep. The architecture generated for the image classification problems are more complex than the ones generated for the other problems. In particular, the evolved chromosome for the CIFAR-10 dataset was of interest because the configuration resembles an architecture that a human might generated when creating a deep neural network for image classification. For instance, consider AlexNet (krizhevsky:2009:learning, ), which is made up of a series of convolutional and max pooling layers towards the start of the network, and ends with three fully connected layers. In a similar way, the chromosome’s architecture which is presented in the appendix has a similar structure of convolutions and max pooling layers followed by fully connected and dropout layers.

Some of the other chromosomes in the other runs for CIFAR-10 evolved similar architectures, but this was not always the case. For example, in one particular run, the evolved architecture was: [0, 0, 0, 2, 2, 2, 2, 0, 0, 1]. In this case, the architecture was primarily made up of dropout and convolutional layers – there was only one fully connected layer. For certain runs, the evolved architectures were made up of deep networks containing only fully connected layers. For example, from the appendix, consider the chromosome presented for the Sensorless Drive dataset; the architecture was [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1].

The number of epochs used throughout the optimisation of the neural networks was small. It would thus be of interest to extend this study in order to investigate the architectures which would be generated by using a larger number of epochs. One drawback of API is the computational effort required to obtain the results. It would be of interest to further decrease the population size to determine to which extent it can be reduced whilst retaining its current accuracy in discriminating between classification and regression problems.

7. Conclusion

In our study, we present the Automated Problem Identification (API) algorithm, a genetic algorithm coupled to deep networks to automatically determining whether a dataset represents a regression or classification problem. While great effort has been put into improving and proposing new machine learning algorithms, typically the practitioner must still decide on the loss function, neural network architecture, number of units in each layer and select appropriate activation functions prior to the execution of the neural network. We propose API with the goal of moving towards general artificial intelligence and automated machine learning that requires little to no human intervention.

API was applied to 20 times each to 16 different datasets drawn from varied problem domains and data characteristics. We find that API correctly identified the problem type with an average accuracy of 96.3% running only a single CPU. Furthermore, API was able to recommend whether to use mean squared error or categorical cross entropy, a suitable number of units in the last layer together with the activation function, and furthermore, recommend a network architecture. Despite not being the primary focus of this study, the proposed algorithm generated interesting and relevant deep architectures.

We have already begun working on the next phase of this research which is to develop an algorithm which can optimise the entire pipeline for creating deep neural networks; whereby, the goal is simply to provide the algorithm with a dataset (without specifying if the problem is a classification or regression problem) and in return, get a deep neural network which can produce competitive results. This would completely remove the human from the pipeline. It would be of interest to determine if the evolved networks could outperform those created by humans. It is clear, with the efforts of various researchers that the machine learning community should steer towards algorithms which are completely automated requiring no human intervention.

Appendix A Examples of API chromosomes

Here we illustrate examples of API chromosomes which were evolved on the various datasets. The dataset name is provided along with the problem type. For the last activation function, ‘MSE’ denotes mean squared error, and ‘CCE’ denotes categorical cross entropy. For the configurations, convolution is mapped to 0, fully connected to 1, dropout to 2 and max pooling to 3. In each example the chromosome was able to correctly classify the dataset.

  • Dataset: Aloi – Classification
    Chromosome: Units: 1000, Loss: CCE, Activation: linear, Configuration: [2, 1, 1, 2, 1]

  • Dataset: Isolet5 – Classification
    Chromosome: Units: 1, Loss: MSE, Activation: softmax, Configuration: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

  • Dataset: Letter Recognition – Classification
    Chromosome: Units: 26, Loss: CCE, Activation: sigmoid, Configuration: [1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1]

  • Dataset: Sensorless Drive – Classification
    Chromosome: Units: 11, Loss: CCE, Activation: relu, Configuration: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

  • Dataset: Year Prediction – Classification
    Chromosome: Units: 64, Loss: CCE, Activation: sotmax, Configuration: [1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1]

  • Dataset: Boston Housing – Regression
    Chromosome: Units: 1, Loss: MSE, Activation: softmax, Configuration: [2, 1, 1, 2, 1, 1, 1, 1, 2, 1]

  • Dataset: CCPP – Regression
    Chromosome: Units: 1, Loss: MSE, Activation: softmax, Configuration: [1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1]

  • Dataset: Concrete Comp – Regression
    Chromosome: Units: 1, Loss: MSE, Activation: softmax, Configuration: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

  • Dataset: Forest Fire –Regression
    Chromosome: Units: 1, Loss: MSE, Activation: softmax, Configuration: [1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1]

  • Dataset: Pysiocochemical – Regression
    Chromosome: Units: 1, Loss: MSE, Activation: softmax, Configuration: [1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]

  • Dataset: Relative CT Slice – Regression
    Chromosome: Units: 1, Loss: MSE, Activation: softmax, Configuration: [1, 2, 1, 1, 2, 1, 1]

  • Dataset: CIFAR-10 – Image classification
    Chromosome: Units: 10, Loss: CCE, Activation: linear, Configuration: [3, 3, 0, 0, 2, 3, 3, 0, 0, 0, 1]

  • Dataset: CIFAR-100 – Image classification
    Chromosome: Units: 100, Loss: CCE , Activation: sigmoid, Configuration: [2, 0, 3, 3, 0, 0, 1, 2, 1, 1, 1]

  • Dataset: MNIST – Image classification
    Chromosome: Units: 10, Loss: CCE, Activation: relu, Configuration: [2, 0, 2, 0, 3, 0, 1]

  • Dataset: CrowdFlower – Sentiment analysis
    Chromosome: Units: 13, Loss: CCE, Activation: sigmoid, Configuration: [1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1]

  • Dataset: IMDB – Sentiment analysis
    Chromosome: Units: 2, Loss: CCE, Activation: softmax, Configuration: [2, 1, 2, 1, 2, 1]

References

  • [1] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
  • [2] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
  • [3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [4] David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1989.
  • [5] Agoston E. Eiben and J. E. Smith.

    Introduction to Evolutionary Computing

    .
    SpringerVerlag, 2003.
  • [6] Yann LeCun et al. Generalization and network design strategies. Connectionism in perspective, pages 143–155, 1989.
  • [7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.
  • [8] Y. T. Zhou and R. Chellappa. Computation of optical flow using a neural network. In IEEE 1988 International Conference on Neural Networks, pages 71–78 vol.2, July 1988.
  • [9] Tobias Blickle and Lothar Thiele. A comparison of selection schemes used in evolutionary algorithms. Evolutionary Computation, 4(4):361–394, December 1996.
  • [10] Jinghui Zhong, Xiaomin Hu, Jun Zhang, and Min Gu. Comparison of performance between different selection strategies on simple genetic algorithms. In International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), volume 2, pages 1115–1121, Nov 2005.
  • [11] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [12] M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.
  • [13] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
  • [15] Patrick E McKnight, Katherine M McKnight, Souraya Sidani, and Aurelio Jose Figueredo. Missing data: A gentle introduction. Guilford Press, 2007.
  • [16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.