Log In Sign Up

Improving generalisation of AutoML systems with dynamic fitness evaluations

by   Benjamin Patrick Evans, et al.

A common problem machine learning developers are faced with is overfitting, that is, fitting a pipeline too closely to the training data that the performance degrades for unseen data. Automated machine learning aims to free (or at least ease) the developer from the burden of pipeline creation, but this overfitting problem can persist. In fact, this can become more of a problem as we look to iteratively optimise the performance of an internal cross-validation (most often k-fold). While this internal cross-validation hopes to reduce this overfitting, we show we can still risk overfitting to the particular folds used. In this work, we aim to remedy this problem by introducing dynamic fitness evaluations which approximate repeated k-fold cross-validation, at little extra cost over single k-fold, and far lower cost than typical repeated k-fold. The results show that when time equated, the proposed fitness function results in significant improvement over the current state-of-the-art baseline method which uses an internal single k-fold. Furthermore, the proposed extension is very simple to implement on top of existing evolutionary computation methods, and can provide essentially a free boost in generalisation/testing performance.


page 1

page 2

page 3

page 4


Cross-Validation with Confidence

Cross-validation is one of the most popular model selection methods in s...

Efficient Cross-Validation of Echo State Networks

Echo State Networks (ESNs) are known for their fast and precise one-shot...

Subject Cross Validation in Human Activity Recognition

K-fold Cross Validation is commonly used to evaluate classifiers and tun...

The HASYv2 dataset

This paper describes the HASYv2 dataset. HASY is a publicly available, f...

1 Introduction

With a rising demand for machine learning coming from a variety of application areas, machine learning talent is struggling to keep up. This has spurred the development of Automated Machine Learning (AutoML), which hopes to save time and effort on repetitive tasks in ML [truong2019towards] by allowing data scientists to work on other important components such as ”developing meaningful hypothesis” or ”communication of results” [le2019scaling]. The usefulness of k-fold cross-validation (CV) has recently been doubted in model evaluation research [zhang2015cross, krstajic2014cross], however, iterative improvement on an internal single k-fold CV remains at the core of many AutoML optimisation problems. In this work, we aim to introduce an efficient approach to approximating repeated k-fold CV (rk

-fold), which has been shown to offer improved error estimation over typical

k-fold CV [zhang2015cross, krstajic2014cross]. This is achieved by proposing a novel dynamic fitness function, which adjusts the calculation at each generation in an effort to prevent overfitting to any one static function. The fitness of an individual is then measured as the individual’s average performance throughout its existence (i.e. averaged over the individual’s lifetime). The proposed approach also does so at far lower computational cost than typical repeated k-fold CV, by utilising the generation mechanism of evolutionary learning.

From an evolutionary learning perspective, the proposed approach can be seen as a form of regularisation, which prefers younger individuals throughout the evolutionary process. From a statistical perspective, the dynamic fitness function can be seen as improving the robustness of the approximation of the true testing performance (i.e. improved generalisation).

The motivation of this work is that current approaches to automated machine learning risk overfitting due to iterative improvement over a fixed fitness function. The goal of automated machine learning is to improve the unseen/generalisation of a pipeline, so mitigating this overfitting is extremely important.

The main contribution of this work is a novel idea of fitness, which serves as a regularisation technique while aiming to approximate a repeated k-fold CV, and thus helps improve generalisation performance. We experimentally show this to be useful in automated machine learning, but the usefulness also holds for many evolutionary computation (EC) techniques which repeatedly optimise a fixed fitness function in an attempt to improve the unseen performance, particularly in large search spaces plagued with local optima. The proposed extension is simple to implement and can serve as a nearly computationally free improvement to many existing EC methods.

The remainder of the paper is organised as follows: Section 2 provides an overview of related AutoML works, Section 3 outlines the newly proposed method, Section 4 compares the new fitness function to the current baseline, Section 5 analyses these differences in-depth, and Section 6 provides the conclusions and outlines future work.

2 Background and Related Work

Automated Machine learning is a new research area, which essentially uses machine learning to perform machine learning [evans2019population]. The cyclic definition may be confusing, but the idea is simple – automate the creation of machine learning pipelines, by treating the construction as an optimisation problem. The goal is to be able to replace the difficult process of selecting an appropriate pipeline with an automatic approach, where all the user needs to do is to specify a dataset and an amount of time to train for, and an appropriate pipeline is returned automatically.

The current top-performing approaches to AutoML are based on EC methods (e.g. TPOT [OlsonGECCO2016]), or Bayesian optimisation (e.g. auto-sklearn [NIPS2015_5872], auto-weka [kotthoff2017auto, thornton2013auto]). Research has found no significant difference in the performance of such methods [amlb2019], and as such the aforementioned methods can all be considered the current state-of-the-art approaches. Here we focus on the EC methods, due to the population mechanism which allows for the proposed expansions without drastically increasing computational costs.

Evolutionary Computation (EC) is an area of nature-inspired techniques that approximate a global search. The search space is effectively (but not exhaustively) explored and exploited using a combination of mutation and crossover operators. Mutation operators randomly modify an individual, while crossover operations combining individuals to produce offspring (children). In this sense, EC techniques can be considered a guided population-based extension of random search [kuncheva1998nearest].

TPOT [OlsonGECCO2016, Olson2016EvoBio, le2019scaling]

is an example of an EC technique based on Genetic Programming


and represents the current state-of-the-art EC-based approach to AutoML. Machine learning pipelines are represented as tree structures, where the root node of the tree is an estimator and all other nodes are preprocessing steps. Preprocessing steps can perform such transformations as feature selection, principal components analysis, scaling, feature construction etc. Fitness is measured by two objectives, the score and the complexity. The score is maximised, and the complexity is minimised using NSGA-II. Here, our main focus is on improving the score, so we refer to fitness as just objective 1 for simplicity, but note that the multiple objectives are still optimised in

Section 3, and we analyse the size in more depth in Section 5.

An area related to AutoML is neural architecture search (NAS), which is essentially AutoML for deep neural networks.

real2019regularized [real2019regularized] propose an EC technique for NAS which uses a novel approach for measuring fitness. Rather than fitness being measured as the performance or loss of a neural network, the fitness is just a measure of age. The younger an individual, the fitter it is considered. As a result, newer models are favoured in the evolutionary process. This is referred to as ”ageing evolution” or ”regularised evolution”. This may work well for neural networks, considering a good random initialisation can result in a good performance ”by chance”. However, what is more important, are neural networks that retrain well (removing the luck of random initialisation). In this sense, with ageing evolution, only models which retrain well can persist.

With general AutoML systems, many of the components are deterministic or at least not overly sensitive to the randomisation (i.e. Random Forests), so the ageing component is not as important directly. However, the idea of retraining well is an important consideration. For example, the fitness is computed as the average performance over an internal

k-fold CV on the training set, yet a pipeline may just perform well on this set of folds but not necessarily on another random set of folds, or more importantly, the unseen test set. We adopt this idea of ”retraining well” by introducing a form of repeated k-fold CV and introduce a novel concept loosely based on age. This is described in more details in Section 3.

It is important to mention the goal here is not to compare neural approaches (i.e. NAS) with more general classification pipelines (i.e. AutoML), but rather to improve an existing approach to general classification pipelines.

In this work, we look at adopting ideas of regularization and a dynamic fitness function and implementing these into a current state-of-the-art AutoML system, TPOT, to investigate whether these ideas can improve the performance of TPOT.

3 Proposed Approach

Figure 1: A comparison of the newly proposed fitness calculation (a.) vs the standard fitness calculation (b.). In the standard calculation (b.) the fitness is measured once and the function is fixed across all generations (the average internal test accuracy from a single k-fold split). In the proposed approach (a.), the fitness is dynamic and changes throughout an individuals lifetime. The fitness is then the average performance throughout an individual’s lifetime (i.e. an approximation to repeated k-fold CV).

We propose a new method where the fitness function is dynamic and the performance is averaged over the lifetime of an individual. This is shown in Fig. 1. From this, we can see that at each generation, the fitness of an individual can change. This is in contrast to the typical approach, where an individual has a fixed fitness value.

For an individual to be competitive, it, therefore, must have performed highly throughout its existence. Younger (newer) individuals have a higher chance of survival, as their performance is less thoroughly evaluated than its predecessors (fewer repetitions).

This is based on the assumption that individuals created from crossover or mutation on well-performing individuals are more likely to be high performing than a randomly generated individual. This is a fair assumption, as evolutionary computation, in general, is based on this idea. If the assumption did not hold, then we would be better performing a random search at every generation and keeping only the best.

The result is that an individual created randomly (or a close descendant of a random individual) will be more thoroughly evaluated over the entire evolutionary process than an individual existing later in the process. An individual which is generated late in the evolution requires fewer evaluations, as it is the offspring or mutation of an individual which has already performed well on these previous evaluations.

There are two ways to think about this process, one from an evolutionary perspective, and one from a statistical perspective. These are examined in the following sections.

3.1 Evolutionary Perspective

From an evolutionary standpoint, individuals have a lifespan (maximum age), and this lifespan is based on the performance of an individual. If an individual performs well throughout its life, then the lifespan is high and it persists through generations. However, if an individual performs poorly in some (or all) stages in its life, then it dies out and is unable to keep spreading its genes into later generations. In this sense, the lifespan is dynamic and changes throughout an individuals life based on how it performs.

3.2 Statistical Perspective

For a given dataset , the data is first split into a training set and a test set . is then given to the AutoML process, and is not seen until after the learning/optimisation has finished. From the training set, an internal CV is performed. is split into equally sized folds , where the distribution of the predictor values is proportionate in each fold (i.e. stratified). Each fold is then used as an internal testing set exactly once (note: this is not , it is a synthetic test set made from ), with the remaining folds becoming the internal training set. The performance of an individual is then measured as the mean performance (in this case f1-score, discussed in Section 4.1) across the folds, which we represent as . measures how well an individual performs on the given folds and is used as an estimate to how the individual will perform on , i.e. . We refer to this process as single k-fold CV.

With TPOT (and other AutoML systems), the goal becomes to optimise . This is achieved with selection that ranks individuals based on . The problem becomes that optimising does not necessarily optimise (the classical definition of overfitting), as

often has a high variance. Although

itself is an average to help mitigate overfitting to a single fold, we still risk overfitting to the specific folds since we iteratively try and improve on the exact folds used (over potentially hundreds or thousands of generations). That is, the maximum achieved increases monotonically throughout evolution, without necessarily resulting in an increase in . This can be seen easily if we consider the selection of folds to be noisy or unrepresentative of data seen in .

The main approach to fixing this in model evaluation literature is to do repeated k-fold cross-validation, in an effort to reduce the variance and improve the stability of single k-fold. zhang2015cross [zhang2015cross] suggest a repeated k-fold over the single k-fold if the primary goal is ”prediction error estimation”, a similar sentiment is shared by krstajic2014cross [krstajic2014cross] who conclude ”selection and assessment of predictive models require repeated cross-validation”. These become even more important concerns when we are iteratively improving on a single k-fold, as is done in AutoML since the risk increases with each generation.

To integrate this repetitive cross-validation into AutoMl, at each fitness evaluation, rather than performing single k-fold CV, we would perform rk-fold validation, where is the repeating factor (). However, with AutoML, function evaluations are already expensive (training and evaluating a model on each fold), and repeating this for every individual would lead to an increase in computation by a factor of , and also requires deciding on a value for , a value which is too high means unnecessary computation per individual and thus less time for guided evolutionary search for better-performing individuals, and a value too low we risk the overfitting discussed above for single k-fold.

Instead, we propose a repetition method which integrates nicely with EC techniques, where does not need to be specified and has a lower computational cost than typical rk-fold. At each generation, a new repetition is performed (new selection of F) for the population, and the performance averaged over the lifetime for each individual.

3.3 Novel Fitness Function

The key contribution of this work is a novel fitness function. A flow chart is given in Fig. 1 which shows the new calculation (a.) compared to the original approach (b.).

Mathematically, the performance of an individual is given in Eq. 1. represents a particular class, the set of all classes, and the number of instances in class . This is the weighted f1-score, although the fitness calculation is independent of the particular performance (or scoring) function used.


The fitness is then measured as the average performance over the lifetime of an individual, as is shown in Eq. 2. Note that there are two objectives, with the second objective (complexity) remains the same as the original measure (number of components in the pipeline).


Since a Pareto front of solutions is maintained throughout evolution, this frontier must be cleared at each generation to remove the saving of individuals which happened to perform well at only a single point in time (and not in general). The simplified pseudo-code is given in Algorithm 1. The model chosen from the population is the one in the frontier with the highest objective 1 score at the end of evolution.

def evaluate(individuals: list, seed: int):
       for ind individuals do
             score = k fold(ind, training data, seed);
             if no ind.scores then
                   ind.scores = [];
             end if
            ind.scores += [score];
    = mean(ind.scores), length(ind);
       end for
def evolve(population_size: int):
       population = [random_individual() * population_size];
       evaluate(population, random_seed=0);
       for gen generations do
             offspring = apply_genetic_operators(population);
             evaluate(offspring, random_seed=gen);
             population = NSGA_II(population + offspring, population_size);
             pareto_front = frontier(population);
       end for
      model = max(pareto_front)
Algorithm 1 Pseudo Code for the algorithm

The function set, terminal set, and evolutionary parameters all remain the same as in original TPOT, with a full description given in [OlsonGECCO2016]. For this reason, these are not expanded here.

3.3.1 Computational Cost

For the single k-fold (default), the total number of models trained is , where in this case (the default behaviour), .

For repeated k-fold, would be performed.

For the proposed approach, evaluations are performed, which can be rewritten as , since by default . We can see for , the proposed becomes more efficient in terms of number of model evaluations for a given number of generations and population size. In the case where , then as long as then the proposed method requires fewer evaluations.

This reduction in computation comes at the fact that individuals are only evaluated throughout their lifetime, and not for generations before and after they were alive. For example, if there are total generations, and an individual is created in generation and dies in generation , then rk-fold will only be repeated times, not .

Therefore, the proposed method is both more computationally feasible (for any , in terms of the total number of folds evaluated), and also removes the need to specify a value which may potentially waste computational time.

3.3.2 Regularisation

This proposed idea, where behaviour is averaged over a lifetime, can be seen as a form of regularisation. real2019regularized [real2019regularized] define regularisation in the broader sense to be ”additional information that prevents overfitting to training noise”. We can see that in this sense, averaging performance over the lifetime of an individual can be seen as a type of regularisation. The additional information comes from the randomisation/dynamic nature of the fitness function. The regularisation effect happens because for an individual to be selected it must either a.) Perform well across random repeated CV, or b.) Be a modification of an individual which itself performed well across random repeated CV. This removes (or at least mitigates) the risk of an individual only performing well on the specific set of folds used throughout the entire evolutionary process with the original (static) fitness function.

This is visualised in Fig. 2. From this figure, we can see we risk selecting a particular model only because it performs well on a specific set of randomly chosen folds (i.e. for a given seed for k-fold CV), and not in general. For example, in 13 of the 30 cases (with r=30), Fig. 2 (b) would have a higher fitness. In 17 of the cases, Fig. 2 (a) would have a higher fitness. Taking the average overall repetitions helps to prevent the selection of a model that has overfit to closely to a given set of folds, and thus aims to regularise the model.

Figure 2: A visual overview of the effect repeated k-fold can have on the fitness. The x-axis represents various runs of k-fold CV (with different seeds). Grey lines represent result for a particular seed. The blue line represents the average over all seeds (i.e. rk-fold). Red asterisks indicate performing the best out of the 2 models for a given seed. This particular scenario was constructed for demonstration purposes.

4 Comparisons

4.1 Setup

For comparisons, we begin with the 42 datasets chosen in [amlb2019] for AutoML benchmarking. However, we find many of these do not generate results within the allowed computational budget. Datasets which had not performed at least two generations before the time limit was reached were excluded, leaving 28 datasets. These datasets were excluded as the effects of the dynamic fitness function would not be evident for only a single generation(as it would function the same a fixed fitness function. We use the most recent version of TPOT (#8b71687) as the baseline and compare to the proposed method, which is the same version of TPOT with an updated fitness function. The comparisons are all run on equivalent hardware, using 2 cores, and the specified amount of training time (1 hour, 3 hours or 6 hours). All code is written and run in Python3.7.

As we are interested in the effect of the new fitness function alone (and not different optimisation methods, search spaces etc), we only compare to TPOT. For example, comparing to AutoWeka we would be comparing entirely different search spaces, likewise comparing to NAS approaches we would be comparing neural networks vs ”traditional” classification algorithms, comparing to auto-sklearn we would be comparing different optimisers (EC vs Bayesian). There have also already been several studies comparing the various AutoML algorithms [amlb2019, truong2019towards, guyon2019analysis], so the goal is not to compare these algorithms again, or propose an entirely new method, but rather investigate the usefulness of a dynamic fitness function in TPOT. For these reasons, the only variation between the two methods is the fitness function to ensure any differences in performance are a direct result of the new fitness function.

All parameters are fixed to the default values. The exception to this is the scoring function, which is accuracy by default. Here, we used the weighted f1 score for both methods instead, as we can not assume equal class distributions as is done with accuracy (the default). Again, we would like to reiterate the proposed method is robust to the selection of the scoring function, and changing out the scoring function in the fitness calculation is trivial.

The underlying search spaces and parameters are therefore equal for both methods. TPOT uses the static (default) fitness, whereas the proposed uses the new fitness described in Section 3.

52-cv is used to generate the results, where the results are presented as . General significance testing is performed using the Wilcoxon Signed rank-sum test [wilcoxon1945], with , pairing each dataset between the two methods as suggested in [demvsar2006statistical]. We do not provide per dataset significance testing, as we are interested in the general performance of the proposed method, and the increased likelihood of false results making such per datasets comparisons heavily doubted for general comparisons [dietterich1998approximate]. Likewise, when discussing wins/losses/draws we do not count significance, as ”counting only significant wins and losses does not make the tests more but rather less reliable, since it draws an arbitrary threshold of p 0.05 between what counts and what does not” [dietterich1998approximate]. As we are providing three tests, the p_values are also adjusted with the Bonferroni correction (i.e. multiplied by 3).

In this section, we look at what impact this new idea of fitness can have on the resulting pipelines, with all other factors fixed.

4.2 Results

We run each method for 1 hour, 3 hours, and 6 hours. Doing so ensures the results do not just hold true at a specific point in time, and also allow us to consider if there are any effects over time.

1 Hour 3 Hour 6 Hours
Proposed TPOT Proposed TPOT Proposed TPOT
adult 86.650.20 86.710.29 86.680.26 86.730.23 86.710.18 86.680.32
anneal 98.980.66 98.440.68 99.020.48 98.890.74 98.410.89 98.920.40
apsfailure 99.340.04 99.340.04
arrhythmia 69.553.52 68.843.38 68.982.71 69.153.04 69.992.72 69.843.86
australian 86.111.05 86.151.17 86.291.19 85.950.95 86.431.30 85.470.96
bank-marketing 90.060.17 86.720.37 90.160.30 86.730.31 90.200.21 87.190.54
blood 76.791.81 72.603.13 76.761.89 74.153.59 76.841.76 73.274.08
car 98.340.82 93.182.83 98.640.62 93.223.28 98.270.72 93.141.87
cnae-9 94.410.50 94.480.62 94.390.80 94.440.91 94.990.40 94.370.63
connect-4 84.240.36 72.441.69 84.050.39 71.770.33
credit-g 73.021.53 73.021.84 72.761.46 72.851.70 73.591.44 72.561.72
dilbert 96.040.57 96.360.55
helena 30.070.76 30.000.71
higgs 72.030.39 72.000.32 72.150.21 72.100.30
jannis 70.010.51 70.410.50
jasmine 80.061.38 80.140.96 80.820.97 80.720.52 81.360.64 81.370.63
jungle 88.141.31 83.840.87 90.411.57 84.211.01 93.041.92 85.071.55
kc1 83.361.50 82.301.05 83.681.24 82.390.76 83.531.64 82.371.31
kr-vs-kp 99.310.15 99.240.29 99.410.22 98.820.79 99.400.19 99.310.16
mfeat-factors 97.110.55 96.750.45 97.550.28 97.070.39 97.450.21 97.460.54
miniboone 94.210.07 94.230.07 94.240.08 94.270.08
nomao 96.670.18 96.610.19 96.790.10 96.600.18
numerai 51.730.16 51.700.26 51.730.13 51.780.14
phoneme 89.390.35 89.260.61 89.610.41 89.550.65 89.650.35 89.580.65
segment 92.931.09 92.870.79 93.340.94 92.830.95 93.210.65 92.840.62
shuttle 99.950.03 99.970.01 99.970.01 99.970.01 99.970.01 99.970.01
sylvine 94.860.33 94.780.38 95.310.40 95.220.47 95.630.59 95.420.65
vehicle 81.042.58 80.131.99 80.581.36 80.682.22 81.171.62 80.682.29
Significance p = 0.01455 p = 0.01173 p = 0.01068
Table 1: Average weighted f1-scores. Scaled to for readability. Presented as mean standard deviation from 5x2 cv. A blank (grey) cell indicates the methods only had time to perform a single generation (or another problem occurred), so comparisons would be meaningless. The final row indicates p-values from Wilcoxon Signed rank test (as described in Section 4.1). Green indicates that the proposed method is significantly better than the baseline at .

From looking at the results in Table 1, we can see that at 1 hour, the proposed method’s average score is better on 13 of the datasets, worse on 6 of the datasets, and on 7 datasets the results were not generated in time. In general, the proposed method performs significantly better when considering a paired Wilcoxon Signed-Rank test, as shown in Table 1. We can see that in cases where the proposed method beats the original, it does so by a much larger margin than when the original beats the proposed, which is reflected in the statistical test by the very small p values indicating high significance.

We can see similar results when considering the 3-hour runs. On 17 of the datasets, the proposed method has a higher average score than the original. On 9 of the datasets, the proposed method has a lower average score than the original. Again, we can see that in general, when viewing the significance tests in Table 1, the proposed method does significantly better than the baseline.

Again, similar results are also seen at the 6-hour point. The proposed method had a higher average score on 19 of the datasets, and a lower average score on 9 of the datasets.

From this, we can conclude the proposed fitness function provides a significant improvement over the single k-fold fitness, and this is reflected throughout the several time points trialled. There is no reason to believe the patterns would be different at higher time frames. In fact, the proposed method should, in theory, perform better as the time goes on (less overfitting).

5 Further Analysis

In this section, we analyse the results from Section 4 in more depth. We consider some of the underlying characteristics, to understand what effect the new fitness function has on resulting models. For this, we use the result of the 6-hour run from the trials in Section 4 as this gives the largest set of results (more datasets) and allows us to potentially find trends over a longer period of time.

Age Generations Difference Complexity
Proposed TPOT Proposed TPOT Proposed TPOT Proposed TPOT
adult 00 21 54 63 0.320.13 0.360.23 1.500.67 2.001.26
anneal 10 7955 11942 12947 1.000.58 0.540.33 2.571.59 2.120.60
apsfailure 10 21 20 20 0.040.03 0.040.02 1.380.70 1.400.92
arrhythmia 10 53 248 176 5.364.23 6.024.60 3.101.14 2.201.25
australian 10 322189 312115 451201 2.961.86 5.331.50 4.701.95 3.801.89
bank-marketing 10 44 92 3011 0.230.25 0.430.45 2.700.78 4.701.00
blood 3295 421312 486112 757315 3.102.34 7.504.54 4.001.41 4.201.99
car 10 5547 6218 18982 0.720.41 4.292.81 4.801.40 5.701.62
cnae-9 39 42 2916 158 0.620.41 1.261.01 3.601.28 3.201.40
connect-4 00 21 31 31 0.520.36 4.880.55 1.400.49 1.900.30
credit-g 2574 117103 239138 291117 3.362.41 6.622.65 4.101.22 5.501.96
dilbert 10 21 20 20 0.890.34 0.710.21 1.600.66 1.100.30
helena 10 10 10 10 0.590.35 0.690.27 1.570.73 1.430.49
higgs 10 21 30 31 0.370.23 0.400.27 1.900.83 1.800.75
jannis 00 11 10 10 0.350.18 0.320.32 1.701.00 1.800.40
jasmine 12 55 226 268 1.350.59 1.260.91 3.000.63 3.400.66
jungle 10 33 172 151 0.880.33 8.361.57 3.100.83 3.001.34
kc1 00 113114 13648 269115 2.301.67 2.721.66 4.902.70 4.501.20
kr-vs-kp 00 2045 4432 4652 0.260.15 0.910.76 3.301.27 4.201.17
mfeat-factors 10 32 141 155 0.690.40 0.550.59 2.670.67 3.201.33
miniboone 00 21 21 30 0.070.04 0.090.06 1.801.08 1.300.46
nomao 00 21 62 72 0.180.11 1.920.29 2.110.74 2.500.67
numerai 10 22 72 71 0.390.15 0.380.20 3.700.90 2.900.83
phoneme 10 3528 13129 13950 0.540.49 0.920.72 4.301.27 4.402.01
segment 10 4147 9042 9449 1.090.93 1.521.05 3.501.63 3.401.50
shuttle 00 22 141 152 0.010.01 0.010.01 2.901.45 2.700.90
sylvine 10 1312 7415 7616 0.620.34 0.630.43 5.501.43 4.401.28
vehicle 10 61136 10152 132130 3.812.37 4.882.29 4.331.49 5.601.43
Significant p=0 p=0.00313 p=0.00105 p=0.69008
Table 2: An analysis of resulting characteristics. Full descriptions for each column are given in Section 5. The main conclusions we can see are that the proposed model results in far younger best individuals on average, and closer predictions to the true testing score. The number of generations was significantly lower than the original, but this was expected due to the extra cost of repeated CV. General significance testing is performed in the final row. Green indicates significant at . Age and generations are rounded to the nearest integer for presentation, but not for significance testing.

5.1 Age and Generations

The age of an individual often gets little consideration in EC algorithms, in favour of just analysing fitness (or performance). However, there is some existing research into age. For example, [hornby2006alps] show that considering an age-layered population (which regularly updates the oldest models with new randomised ones) can help to avoid local-optima by promoting diversity in the population. [real2019regularized] also make interesting discoveries when using age alone as a measure of fitness, rather than the performance. They found improved results due to the implicit regularisation of the individuals, as it means individuals were retraining well to persist in the population.

It is clear that age can be a useful characteristic for helping to improve performance of EC techniques, and one of the ideas behind the proposed fitness function (average performance over lifetime) is that it will become increasingly difficult for an older individual to exist throughout generations, which also serves to diversify the population by ”clearing” out older individuals.

Looking at Table 2, we can see that this is, in fact, the case, and the age of the best resulting individuals in the proposed method is often either 0 (i.e. generated in the final generation) or 1 (generated in the second to last generation). The notable exception to this is with the blood dataset and the credit-g dataset. On the blood dataset, the average age was 32. However, the number of generations here was also the highest (486), and the age is still far lower than the original method (198), meaning the individuals are still relatively young. Likewise, with the credit-g dataset, the average age is 25, buts this is also far lower than the average age of 117 from the baseline. When compared to the baseline, the resulting models are far younger in general. While this isn’t necessarily useful on its own, we can see by the results in Table 1 that this youthfulness has shown useful.

Figure 3: Relative Age

The relative age (percentage of generations) from the two methods is shown in Fig. 3.

Rather than analysing age directly (as is done in [hornby2006alps, real2019regularized]), the proposed approach made it more difficult for older individuals to persist by averaging the performance over the lifetime. Despite all using different methods to bias younger individuals, the results here confirm the usefulness of age, which is consistent with the observations in both [hornby2006alps, real2019regularized].

5.2 Approximation of (Difference)

The overall goal of the fitness function is to maximise our true test score of . Of course, we can not do this directly so we use as a proxy for . The measure of how good this proxy is is given in Table 2 as ”Difference”.

This is just measured directly as , to measure how much ”over-fitting” is occurring. The ideal difference is thus 0, with the worst being 100. Of course, this is not a perfect measure, as we could have something such as , which would be a good proxy but have a large difference. We assume this can not occur since both scoring functions are the same just on different sets of data. However, there could also be more complex underlying relationships, which would not be found with this difference measure.

Therefore, this measure alone should be interpreted with caution, but when paired with the testing results in Table 1 we can get a better understanding of the approximation. For example, we could have a perfect proxy (, difference ), but if is very low then that is not ideal.

Pairing the approximation with the true testing accuracy in Table 1, we can see both a closer approximation of than the original method and also higher resulting testing accuracies (both statistically significant). This confirms that our approximation of a repeated k

-fold CV is useful for getting a more unbiased estimate of

. This is very important since the goal of AutoML is to improve indirectly by improving (since we can not directly optimise the testing performance), so achieving an unbiased estimate assists this goal.

5.3 Complexity

TPOT (and by extension the proposed method here) already uses NSGA-II to balance the complexity of the pipelines with the performance, where the goal is to minimise complexity and maximise performance (f1-score).

In this work, we focus particularly on improving the performance of the pipelines and as such, all discussion up until this point has focused on the classification performance. However, an important concern is that this does not come at the expense of an increase in complexity.

Therefore, we analyse whether the new regularised evolution has any additional effect on the size of the pipelines. This is shown in the final columns of Table 2, which gives the average size of the best resulting individual from each run of the proposed method and each run of the baseline method across every dataset. No statistically significant difference in the size was found, which is reassuring. This means dynamically changing one objective while leaving the other fixed had no negative impact on the fixed objective. This also meant good individuals which occurred later in the population were no more likely to be larger than individuals which occurred early, which is what can often be seen in single objective GP (see ”bloat” [whigham2009implicitly]).

Figure 4: Dominance Plot. Each point represents the average result on a dataset. Purple points are the proposed method, and orange points are the baseline (original) method. Lines pair datasets between methods. A green line indicates the proposed method dominates the baseline. A red line indicates the baseline dominated the proposed. A grey line indicates no dominance, i.e. one method achieved better in objective 1 but the other method did better in objective 2.

To validate the claims above that no negative effect was seen on complexity, we also perform additional comparison considering both objectives. In Fig. 4, we visualise a dominance plot. A method dominates another method on a particular dataset if at least one resulting objective (complexity or performance) is strictly better than the other method’s corresponding objective, and all other objectives are at least as good as the other methods. We can see on 11 datasets, the proposed method dominates the baseline. On 5 datasets, the baseline dominates the proposed. On the remaining datasets, neither method dominates each other (i.e. one objective was better, but the other was worse – a trade-off). Furthermore, in the majority of the cases (4 out of 5) where the proposed method is dominated, these are on the simpler problems (i.e. close towards a perfect test performance with a complexity of 1), whereas on the more difficult problems the improvements become more apparent.

As a result, we can conclude that no negative effect can be seen on the complexity (i.e. no increase), but a positive effect can be seen on the performance with the newly proposed fitness function, particularly on more complex datasets. The result is improved pipeline performance at no increase in pipeline complexity.

6 Conclusions and Future Work

In this work, we proposed a novel fitness function which can be used to improve the generalisation ability of AutoML, by serving as an implicit form of regularisation. The fitness function is dynamic and changes at each generation. The fitness of an individual is then measured as its average throughout the individual’s lifetime. We implemented this new fitness in place of the standard fitness evaluations in the current state-of-the-art AutoML method TPOT and showed that we achieve significant improvement over the standard (static) fitness function in general, on all time equated comparisons.

The improvement in performance is due to the fact that the new fitness function approximates repeated k-fold CV, which helps prevent overfitting that can occur due to iterative improvements over a limited number of folds, while also avoiding the manual specification of a repetition factor . We empirically show this to work well for AutoML problems, but the proposed fitness function is general enough to be implemented to any EC methods with static fitness functions as a ”free” improvement to help improve generalisation, particularly in large searches plagued with local optima (such as with AutoML).

For further work, there is already much research into model evaluation schemes [bengio2004no, krstajic2014cross, zhang2015cross, dietterich1998approximate, nadeau2000inference, bouckaert2004evaluating, demvsar2006statistical, vanwinckelen2012estimating], however, a thorough analysis of the impact these would have in AutoML has yet to be conducted and is beyond the scope of this paper. For example, should we be checking if improvements are statistically significant at each generation? If not, are there better ways to improve our approximation of for comparing these methods directly?

Another potential research direction is based on ensemble learning. Here, we average performance over the lifetime of an individual by altering the fitness function at each generation to ensure generalisation performance. An alternative approach could store the best individual from each generation (and thus best individual for each split of the data), and then use the best resulting individuals from every generation as an ensemble. In this sense, a ”free” ensemble could be constructed, but as a result, the final pipelines would be far more complex (an ensemble with the size equal to the number of generations). Other methods could also be considered based on this idea, where an ensemble could be easily constructed due to the randomness in the fitness function (which indirectly creates diversity).