1 Introducing MLaut
MLaut  is a modelling and workflow toolbox in python, written with the aim of simplifying large scale benchmarking of machine learning strategies, e.g., validation, evaluation and comparison with respect to predictive/task-specific performance or runtime. Key features are:
automation of the most common workflows for benchmarking modelling strategies on multiple datasets including statistical post-hoc analyses, with user-friendly default settings
unified interface with support for scikit-learn strategies, keras deep neural network architectures, including easy user extensibility to (partially or completely) custom strategies
higher-level meta-data interface for strategies, allowing easy specification of scikit-learn pipelines and keras deep network architectures, with user-friendly (sensible) default configurations
easy setting up and loading of data set collections for local use (e.g., data frames from local memory, UCI repository, openML, Delgado study, PMLB)
back-end agnostic, automated local file system management of datasets, fitted models, predictions, and results, with the ability to easily resume crashed benchmark experiments with long running times
MLaut may be obtained from pyPI via pip install mlaut, and is maintained on GitHub at github.com/alan-turing-institute/mlaut. A Docker implementation of the package is available on Docker Hub via docker pull kazakovv/mlaut.
Note of caution: time series and correlated/associated data samples
MLaut implements benchmarking functionality which provides statistical guarantees under assumption of either independent data samples, independent data sets, or both. This is mirrored in Section 2.3 by the crucial mathematical assumptions of statistical independence (i.i.d. samples), and is further expanded upon in Section 2.4.
In particular, it should be noted that naive application of the validation methodology implemented in MLaut to samples of time series, or other correlated/associated/non-independent data samples (within or between datasets), will in general violate the validation methodologies’ assumptions, and may hence result in misleading or flawed conclusions about algorithmic performance.
The BSD license under which MLaut is distributed further explicitly excludes liability for any damages arising from use, non-use, or mis-use of MLaut (e.g., mis-application within, or in evaluation of, a time series based trading strategy).
1.1 State-of-art: modelling toolbox and workflow design
A hierarchy of modelling designs may tentatively be identified in contemporary machine learning and modelling ecosystems, such as the python data science environment and the R language:
provision of a unified interface for methodology solving the same “task”, e.g., supervised learning aka predictive modelling. This is one core feature of the Weka , scikit-learn  and Shogun  projects which both also implement level 1 functionality, and main feature of the caret  and mlr  packages in R which provides level 2 functionality by external interfacing of level 1 packages.
composition and meta-learning interfaces such as tuning and pipeline building, more generally, first-order operations on modelling strategies. Packages implementing level 2 functionality usually (but not always) also implement this, such as the general hyper-parameter tuning and pipeline composition operations found in scikit-learn and mlr or its mlrCPO extension. Keras  has abstract level 3 functionality specific to deep learning, Shogun possesses such functionality specific to kernel methods.
workflow automation of higher-order tasks performed with level 3 interfaces, e.g., diagnostics, evaluation and comparison of pipeline strategies. Mlr is, to our knowledge, the only existing modelling toolbox with a modular, class-based level 4 design that supports and automates re-sampling based model evaluation workflows. The Weka GUI and module design also provides some level 4 functionality.
A different type of level 4 functionality is automated model building, closely linked to but not identical with benchmarking and automated evaluation - similarly to how, mathematically, model selection is not identical with model evaluation. Level 4 interfaces for automated model building also tie into level 3 interfaces, examples of automated model building are implemented in auto-Weka , auto-sklearn , or extensions to mlrCPO .
In the Python data science environment, to our knowledge, there is currently no widely adopted solution with level 4 functionality for evaluation, comparison, and benchmarking workflows. The reasonably well-known skll  package provides automation functionality in python for scikit-learn based experiments but follows an unencapsulated scripting design which limits extensibility and usability, especially since it is difficult to use with level 3 functionality from scikit-learn or state-of-art deep learning packages.
. Smaller studies, focusing on a couple of estimators trained on a small number of datasets have also been published. However, to the best of our knowledge: none of the authors released a toolbox for carrying out the experiments; code used in these studies cannot be directly applied to conduct other machine learning experiments; and, deep neural networks were not included as part of the benchmark exercises.
At the current state-of-art, hence, there is a distinct need for level 4 functionality in the scikit-learn and keras ecosystems. Instead of re-creating the mlr interface or following a GUI-based philosophy such as Weka, we have decided to create a modular workflow environment which builds on the particular strengths of python as an object oriented programming language, the notebook-style user interaction philosophy of the python data science ecosystem, and the contemporary mathematical-statistical state-of-art with best practice recommendations for conducting formal benchmarking experiments - while attempting to learn from what we believe works well (or not so well) in mlr and Weka.
1.2 Scientific contributions
MLaut is more than a mere implementation of readily existing scientific ideas or methods. We argue that the following contributions, outlined in the manuscript, are scientific contributions closely linked to its creation:
design of a modular “level 4” software interface which supports the predictive model validation/comparison workflow, a data/model file input/output back-end, and an abstraction of post-hoc evaluation analyses, at the same time.
a comprehensive overview of the state-of-art in statistical strategy evaluation, comparison and comparative hypothesis testing on a collection of data sets. We further close gaps in said literature by formalizing and explicitly stating the kinds of guarantees the different analyses provide, and detailing computations of related confidence intervals.
as a principal test case for MLaut, we conducted a large-scale supervised classification study in order to benchmark the performance of a number of machine learning algorithms, with a key sub-question being whether more complex and/or costly algorithms tend to perform better on real-world datasets. On the representative collection of UCI benchmark datasets, kernel methods and random forests perform best.
as a specific but quite important sub-question we empirically investigated whether common off-shelf deep learning strategies would be worth considering as a default choice on the “average” (non-image, non-text) supervised learning dataset. The answer, somewhat surprising in its clarity, appears to be that they are not - in the sense that alternatives usually perform better. However, on the smaller tabular datasets, the computational cost of off-shelf deep learning architectures is also not as high as one might naively assume. This finding is also subject to a major caveat and future confirmation, as discussed in Section 5.4.3 and Section 5.6.4.
Literature relevant to these contribution will be discussed in the respective sections.
1.3 Overview: usage and functionality
We present a short written demo of core MLaut functionality and user interaction, designed to be convenient in combination with jupyter notebook or scripting command line working style. Introductory jupyter notebooks similar to below may be found as part of MLaut’s documentation .
The first step is setting up a database for the dataset collection, which has to happen only once per computer and dataset collection, and which we assume has been already stored in a local MLaut HDF5 database. The first step in the core benchmarking workflow is to define hooks to the database input and output files:
After the hooks are created we can proceed to preparing fixed re-sampling splits (training/test) on which all strategies are evaluated. By default MLaut creates a single evaluation split with a uniformly sampled of the data for training and for testing.
For a simple set-up, a standard set of estimators that come with sensible parameter defaults can be initialized. Advanced commands allow to specify hyper-parameters, tuning strategies, keras deep learning architectures, scikit-learn pipelines, or even fully custom estimators.
The user can now proceed to running the experiments. Training, prediction and evaluation are separate; partial results, including fitted models and predictions, are stored and retrieved through database hooks. This allows intermediate analyses, and for the experiment to easily resume in case of a crash or interruption. If this happens, the user would simply need to re-run the code above and the experiment will continue from the last checkpoint, without re-executing prior costly computation.
The last step in the pipeline is executing post-hoc analyses for the benchmarking experiments. The AnalyseResults class allows to specify performance quantifiers to be computed and comparison tests to be carried out, based on the intermediate computation data, e.g., predictions from all the strategies.
The prediction_errors() method returns two sets of results: errors_per_estimator dictionary which is used subsequently in further statistical tests and errors_per_dataset _per_estimator_df which is a dataframe with the loss of each estimator on each dataset that can be examined directly by the user.
We can also use the produced errors in order to perform the statistical tests for method comparison. The code below shows an example of running a t-test.
Data frames or graphs resulting from the analyses can then be exported, e.g., for presentation in a scientific report.
MLaut is part of VK’s PhD thesis project, the original idea being suggested by FK. MLaut and this manuscript were created by VK, under supervision by FK. The design of MLaut is by VK, with suggestions by FK. Sections 1, 2 and 3 were substantially edited by FK before publication, other sections received only minor edits (regarding content). The benchmark study of supervised machine learning strategies was conducted by VK.
We thank Bilal Mateen for critical reading of our manuscript, and especially for suggestions of how to improve readability of Section 2.4.
FK acknowledges support by The Alan Turing Institute under EPSRC grant EP/N510129/1.
2 Benchmarking supervised learning strategies on multiple datasets - generative setting
This section introduces the mathematical-statistical setting for the mlaut toolbox - supervised learning on multiple datasets. Once the setting is introduced, we are able to describe the suite of statistical benchmark post-hoc analyses that mlaut implements, in Section 3.
2.1 Informal workflow description
Informally, and non-quantitatively, the workflow implemented by mlaut is as follows: multiple prediction strategies are applied to multiple datasets, where each strategy is fitted to a training set and queried for predictions on a test set. From the test set predictions, performances are computed: performances by dataset, and also overall performances across all datasets, with suitable confidence intervals. For performance across all datasets, quantifiers of comparison (“is method A better than method B overall?”) are computed, in the form statistical (frequentist) hypothesis tests, where p-values and effect sizes are reported.
The remainder of this Section 2 introduces the generative setting, i.e., statistical-mathematical formalism for the data sets and future situations for which performance guarantees are to be obtained. The reporting and quantification methodology implemented in the mlaut package is described in Section 3 in mathematical language, usage and implementation of these in the mlaut package is described in Section 4.
From a statistical perspective, it should be noted that only a single train/test split is performed for validation. This is partly due to simplicity of implementation, and partly due to the state-of-art’s incomplete understanding of how to obtain confidence intervals or variances for re-sampled performance estimates. Cross-validation strategies may be supported in future versions.
A reader may also wonder about whether, even if there is only a single set of folds, should there not be three folds per split (or two nested splits), into tuning-train/tuning-test/test111What we call the “tuning-test fold” is often, somewhat misleadingly, called a “validation fold”. We believe the latter terminology is misleading, since it is actually the final test fold which validates the strategy, not second fold.. The answer is: yes, if tuning via re-sample split of the training set is performed. However, in line with current state-of-art understanding and interface design, tuning is considered as part of the prediction strategy. That is, the tuning-train/tuning-test split is strategy-intrinsic. Only the train/test split is extrinsic, and part of the evaluation workflow which mlaut implements; a potential tuning split is encapsulated in the strategy. This corresponds with state-of-art usage and understanding of the wrapper/composition formalism as implemented for example with GridSearchCV in sklearn.
2.2 Notational and mathematical conventions
To avoid confusion between quantities which are random and non-random, we always explicitly say if a quantity is a random variable. Furthermore, instead of declaring the type of a random variable, say, by writing it out as a measurable function , we say “ is a random variable taking values in ”, or abbreviated “ t.v.in
”, suppressing mention of the probability spacewhich we assume to be the same for all random variables appearing.
This allows us easily to talk about random variables taking values in certain sets of functions, for example a prediction functional obtained from fitting to a training set. Formally, we will denote the set of functions from a set to a set by the type theoretic arrow symbol , where bracketing as in may be added for clarity and disambiguation. E.g., to clarify that we consider a function valued random variable , we will say for example “let be a random variable t.v.in ”.
An observant reader familiar with measure theory will notice a potential issue (others may want to skip to the next sub-section): the set is, in general, not endowed with a canonical measure. This is remedied as follows: if we talk about a random variable taking values in , it is assumed that the image of the corresponding measurable function , which may not be all of , is a measurable space. This is, for example, the case we substitute training data random variables in a deterministic training functional , which canonically endows the image of with the substitution push-forward measure.
2.3 Setting: supervised learning on multiple datasets
We introduce mathematical notation to describe datasets, and prediction strategies. As running indices, we will consistently use for the dataset, for (training or test) data point in a given data set, and for the estimator.
The data in the -th dataset are assumed to be sampled from mutually independent, generative/population random variables , taking values in feature-label-pairs , where either (regression) or is finite (classification). In particular we assume that the label type is the same in all datasets.
The actual data are i.i.d. samples from the population , which for notational convenience we assume to be split into a training set and a test set Note that the training and test set in the -th dataset are, formally, not “sets” (as in common diction) but ordered tuples of length and . This is for notational convenience which allows easy reference to single data points. By further convention, we will write for the ordered tuple of test labels.
On each of the datasets, different prediction strategies are fitted to the training set: these are formalized as random prediction functionals t.v.in , where and . We interpret as the fitted prediction functional obtained from applying the -th prediction strategy on the -th dataset where it is fitted to the training set.
Statistically, we make mathematical assumptions to mirror the reasonable intuitive assumptions that there is no active information exchange between different strategies, a copies of a given strategy applied to different data sets: we assume that the random variable may depend on the training set , but is independent of all other data, i.e., the test set of the -th dataset, and training and test sets of all the other datasets. It is further assumed that is independent of all other fitted functionals where and is entirely arbitrary. It is also assumed that is conditionally independent of all where , given .
We further introduce notation for predictions i.e., is the prediction made by the fitted prediction functional for the actually observed test label .
For convenience, the same notation is introduced for the generative random variables, i.e., Similarly, we denote by
the random vectors of lengthwhose entries are predictions for full test sample, made by method .
2.4 Performance - which performance?
Benchmarking experiments produce performance and comparison quantifiers for the competitor methods. It is important to recognise that these quantifiers are computed to create guarantees for the methods’ use on putative future data
. These guarantees are obtained based on mathematical theorems such as the central limit theorem, applicable under empirically justified assumptions. It is crucial to note that mathematical theorems allow establishing performance guarantees on future data, despite the future data not being available to the experimenter at all. It is also important to note that the future data for which the guarantees are created are different from, and in general not identical to, the test data.
Contrary to occasional belief, performance on the test data in isolation is empirically not useful: without a guarantee it is unrelated to the argument of algorithmic effectivity the experimenter wishes to make.
While a full argument usually does involve computing performance on a statistically independent test set, the argumentative reason for this best practice is more subtle than being of interest by itself. It is a consequence of “prediction” performance on the training data not being be a fair proxy for performance on future data. Instead, “prediction” on an unseen (statistically independent) test set is a fair(er) proxy, as it allows for formation of performance guarantees on future data: the test set being unseen allows to leverage the central limit theorems for this purpose.
In benchmark evaluation, it is hence crucial to make precise the relation between the testing setting and the application case on future data - there are two key types of distinctions on the future data application case:
whether in the scenario, a fitted prediction function is to be re-used, or whether it is re-fitted on new data (potentially from a new data source).
whether in the scenario, the data source is identical with the source of one of the observed datasets, or whether the source is merely a source from the same population as the data sources observed.
Being precise about these distinctions is, in fact, practically crucial: similar to the best practice of not testing on the training set, one needs to be careful about whether a data source, or a fitted strategy that will occur in the future test case has already been observed in the benchmarking experiment, or not.
We make the above mathematically precise (a reader interested only in an informal explanation may first like skip forward to the subsequent paragraph).
To formalize “re-use”, distinction (i) translates to conditioning on the fitted prediction functionals , or not. Conditioning corresponds to prior observation, hence having observed the outcome of the fitting process, therefore “re-using” . Not doing so corresponds to sampling again from the random variable, hence “re-fitting”.
To formalize the “data source” distinction, we will assume an i.i.d. process
(taking values in joint distributions overalso selected at random), generating distributions according to which population laws are distributed, i.e., is an i.i.d. sample. The -th element of this sample, is the (generating) data source for the -th data set i.e., . We stress that takes values in distributions, i.e., is a distribution which is itself random222Thus, the symbol is used here in its common “distribution” and not “distribution of random variable” meaning which are usually confounded by abuse of notation and from which data are generated. In this mathematical setting, the distinction (ii) then states whether the guarantee applies for data sampled from with a specific , or instead data sampled from The former is “data from the already observed -th source, the latter is “data from a source similar to, but not identical to, the observed source”. If the latter is the case, the same generative principle is applied to yield a prediction functional , drawn i.i.d. from a hypothetical generating process which yielded the on the -th dataset. We remain notationally consistent by defining
For intuitive clarity, let us consider an example
: three supervised classification methods, a random forest, logistic regression, and the baseline “predicting the majority class” are benchmarked on 50 datasets, from 50 hospitals, one dataset corresponding to observations in exactly one hospital. Every dataset is a sample of patients (data frame rows) for which as variables (data frame rows) the outcome (= prediction target and data frame column) therapy success yes/no for a certain disease is recorded, plus a variety of demographic and clinical variables (data frame columns) - where what is recorded differs by hospital.
A benchmarking experiment may be asked to produce a performance quantifier for one of the following three distinct key future data scenarios:
re-using the trained classifiers (e.g., random forest), trained on the training data of hospital 42, to make predictions on future data observed in hospital 42.
(re-)fitting a given classifier (e.g., random forest) to new data from hospital 42, to make predictions on further future data observed in hospital 42.
obtaining future data from a new hospital 51, fitting the classifiers to that data, and using the so fitted classifiers to make predictions on further future data observed from hospital 51.
It is crucial to note that both performances and guarantees may (and in general will) differ between these three scenarios. In hospital 42, a random forests may outperform logistic regression and the baseline, while in hospital 43 nothing outperforms the baseline. The behaviour and ranking of strategies may also be different, depending on whether classifiers are re-used, or re-fitted. This may happen in the same hospital, or when done in an average unseen hospital. Furthermore, the same qualitative differences as for observed performances may hold for the precision of the statistical guarantees obtained from performances in a benchmarking experiment: the sample size of patients in a given hospital may be large enough or too small to observe a significant difference of performances in a given hospital, while the sample size of hospitals is the key determinant of how reliable statistical guarantees about performances and performance differences for unseen hospitals are.
In the subsequent, we introduce abbreviating terminology for denoting the distinctions above: for (i), we will talk about re-used (after training once) and re-trained (on new data) prediction algorithm. For (ii), we will talk about seen and unseen data sources. Further, we will refer to the three future data scenarios abbreviatingly by the letters (a), (b), and (c). By terminology, in these scenarios the algorithm is: (a) re-used on seen sources, (b) re-trained on seen sources, and (c) re-trained on an unseen source (similar to but not identical to seen sources).
It should be noted that it is impossible to re-use an algorithm on an unseen source, by definition of the word “unseen”, hence the hypothetical fourth combination of the two dichotomies re-used/re-trained and unseen/seen is logically impossible.
2.5 Performance quantification
Performance of the prediction strategy is measured by a variety of quantifiers which compare predictions for the test set with actual observations from the test set, the “ground truth”. Three types of quantifiers are common:
Average loss based performance quantifiers, obtained from a comparison of one method’s predictions and ground truth observations one-by-one. An example is the mean squared error on the test set, which is the average squared loss.
Aggregate performance quantifiers, obtained from a comparison of all of a given method’s predictions with all of the ground truth observations. Examples are sensitivity or specifity.
Ranking based performance quantifiers, obtained from relative performance ranks of multiple methods, from a ranked comparison against each other. These are usually leveraged for comparative hypothesis tests, and may or may not involve computation of ranks based on average or aggregate performances as in (i) and (ii). Examples are the Friedman rank test to compare multiple strategies.
The three kinds of performance quantifiers are discussed in more detail below.
2.5.1 Average based performance quantification
For this, the most widely used method is a loss (or score) function , which compares a single prediction (by convention the first argument) with a single observation (by convention the second argument).
Common examples for such loss/quantifier functions are listed below in Table 1.
List of some popular loss functions to measure prediction goodness (2nd column) used in the most frequent supervised prediction scenarios (1st column). Above,and are elements of . For classification, is discrete; for regression, . The symbol evaluates to if the boolean expression is true, otherwise to .
In direct alignment with the different future data scenarios discussed in Section 2.4, the distributions of three generative random variables are of interest:
The conditional random variable , the loss when predicting on future data from the -th data source, when re-using the already trained prediction functional . Note that formally, through conditioning is implicitly considered constant (not random), therefore reflects re-use of an already trained functional.
The random variable , the loss when re-training method on training data from the -th data source, and predicting labels on future data from the -th data source. Without conditioning, no re-use occurs, and this random variable reflects repeating the whole random experiment including re-training of .
The random variable , the loss when training method on a completely new data source, and predicting labels on future data from the same source as that dataset.
The distributions of the above random variables are generative, hence unknown. In practice, the validation workflow estimates summary statistics of these. Of particular interest in the mlaut workflow are related expectations, i.e., (arithmetic) population average errors. We list them below, suppressing notational dependency on for ease of notation:
, the (training set) conditional expected generalization error of (a re-used) , on data source .
, the conditional expected generalization error of the (re-used) -th strategy, averaged over all seen data sources.
, the unconditional expected generalization error of (a re-trained) , on data source .
, the expected generalization error on a typical (unseen) data source.
It should be noted that and are random quantities, but conditionally constant once the respective are known (e.g., once has been trained). It further holds that
The mlaut toolbox currently implements estimators for only two of the above three future data situations - namely, only for situations (a: re-used, seen) and (c: re-trained, unseen), i.e., estimators for all quantities with the exception of . The reason for this is that for situation (b: re-trained, seen), at the current state of literature it appears unclear how to obtain good estimates, that is, with provably favourable statistical properties independent of the data distribution or the algorithmic strategy. For situations (a) and (c), classical statistical theory may be leveraged, e.g., mean estimation and frequentist hypothesis testing.
It should also be noted that is a single dataset performance quantifier rather than a benchmark performance quantifier, and therefore outside the scope of mlaut’s core use case. While is also a single dataset quantifier, it is easy to estimate en passant while estimating the benchmark quantifier , hence included in discussion as well as in mlaut’s functionality.
2.5.2 Aggregate based performance quantification
A somewhat less frequently used alternative are aggregate loss/score functions , which compare a tuple of predictions with a tuple of observations in a way that is not expressible as a mean loss such as in Section 2.5.1. Here, by slight abuse of notation, denotes tuples of -pairs, of fixed length. The use of the symbol is discordant with the previous section and assumes a case distinction on whether an average or an aggregate is used.
The most common uses of aggregate performance quantifiers are found in deterministic binary classification, as entries of the classification contingency table. These, and further common examples are listed below in Table2.
|classification (det., binary)||sensitivity, recall|
|regression||root mean squared error|
As before, for the different future data scenarios in Section 2.4, the distributions of three types of generative random variables are of interest. The main complication is that aggregate performance metrics take multiple test points and predictions as input, hence to specify a population performance one must specify a test set size. In what follows, we will fix a specific test set size, , for the -th dataset. Recall the notation for the full vector of test labels on data set . In analogy, we abbreviatingly denote by random vectors of length whose entries are predictions for full test sample, made by method , i.e., having as the -th entry to predictions , as introduced in Section 2.3. Similarly, we denote by and vectors whose entries, are i.i.d. from the data generating distribution of the new data source, and both of length , which is by assumption the sampling distribution of the .
The population performance quantities of interest can be formulated in terms of the above:
, the (training set) conditional expected generalization error of (a re-used) , on data source .
, the conditional expected generalization error of the (re-used) -th strategy, averaged over all seen data sources.
, the unconditional expected generalization error of (a re-trained) , on data source .
, the expected generalization error on a typical (unseen) data source.
As before, the future data situations are (a: re-used algorithm, seen sources), (b: re-trained, seen), and (c: re-trained, unseen). In the general setting, the expectations in (a) and (b) may or may not converge to sensible values as approaches infinity, depending on properties of . General methods of estimating these depend on availability of test data, which due to the complexities arising and the currently limited state-of-art are outside the scope of mlaut. This unfortunately leaves benchmarking quantity outside the scope for aggregate performance quantifiers. For (c), classical estimation theory of the mean applies.
2.5.3 Ranking based performance quantification
Ranking based approaches consider, on each dataset, a performance ranking of the competitor strategies with respect to a chosen raw performance statistic, e.g., an average or an aggregate performance such as RMSE or F1-score. Performance assessment is then based on the rankings - in the case of ranking, this is most often a comparison, usually in the form of a frequentist hypothesis test. Due to the dependence of the ranking on a raw performance statistic, it should always be understood that ranking based comparisons are with respect to the chosen raw performance statistic, and may yield different results for different raw performance statistics.
Mathematically, we introduce the population performances in question. Denote in the case the raw statistic being an average, and denote in case it is an aggregate (on the RHS using notation of the respective previous Sections 2.5.1 and 2.5.2). The distribution of models generalization performance of the -th strategy on the -th dataset.
We further define rankings as the order rank of within the tuple , i.e., the ranking of the performance within all strategies’ performances on the -th dataset.
Of common interest in performance quantification and benchmark comparison are the average ranks, i.e., ranks of a strategy averaged over datasets. The population quantity of interest is the expected average rank on a typical dataset, i.e., where is the population variable corresponding to sample variables . It should be noted that the average rank depends not only on what the -th strategy is or does, but also on the presence of the other strategies in the benchmarking study - hence it is not an absolute performance quantifier for a single method, but a relative quantifier, to be seen in the context of the competitor field.
Common benchmarking methodology of the ranking kind quantifies relative performance on the data sets observed in the sense of future data scenario (b) or (c), where the performance is considered including (re-)fitting of the strategies.
3 Benchmarking supervised learning strategies on multiple datasets - methods
We now describe the suite of performance and comparison quantification methods implemented in the mlaut package. It consists largely of state-of-art of model comparison strategies for the multiple datasets situation, supplemented by our own constructions based on standard statistical estimation theory where appropriate. References and prior work will be discussed in the respective sub-sections. mlaut supports the following types of benchmark quantification methodology and post-hoc analyses:
loss-based performance quantifiers, such as mean squared error and mean absolute error, including confidence intervals.
aggregate performance quantifiers, such as contingency table quantities (sensitivity, specifity) in classification, including confidence intervals.
rank based performance quantifiers, such as average performance rank.
comparative hypothesis tests, for relative performance of methods against each other.
The exposition uses notation and terminology previously introduced in Section 2. Different kinds of quantifiers (loss and/or rank based), and different kinds of future performance guarantees (trained vs re-fitted prediction functional; seen vs unseen sources), as discussed in Section 2.4, may apply across all types of benchmarking analyses.
Which of these is the case, especially under which future data scenario the guarantee given is supposed to hold, will be said explicitly for each, and should be taken into account by any use of the respective quantities in scientific argumentation.
Practically, our recommendation is to consider which of the future data scenarios (a), (b), (c) a guarantee is sought for, and whether evidencing differences in rank, or differences in absolute performances, are of interest.
3.1 Average based performance quantifiers and confidence intervals
For average based performance quantifiers, performances and their confidence intervals are estimated from the sample of loss/score evaluates. We will denote the elements in this sample by (for notation on RHS see Section 2.5.1 ). Note that, differently from the population quantities, there are three (not two) indices: for the strategy, for the dataset, and for which test set point we are considering.
|estimate||estimates||f.d.s.||standard error estimate||CLT in|
Table 3 presents a number of expected loss estimates with proposed standard error estimates. As all estimates are mean estimates of independent (or conditionally independent) quantities, normal approximated, two-sided confidence intervals may be obtained for any of the quantities in the standard way, i.e., at confidence as the interval
where is the respective (mean) estimate and is the corresponding standard error estimate.
Note that different estimates and confidence intervals arise through the different future data scenarios that the guarantee is meant to cover - see Sections 2.5.1 and 2.4 for a detailed explanation how precisely the future data scenarios differ in terms of re-fitting/re-using the prediction functional, and obtaining performance guarantees for predictive use on an unseen/seen data source. In particular, choosing a different future data scenario may affect the confidence intervals even though the midpoint estimate is the same: the midpoint estimates and coincide, but the confidence intervals for future data scenario (c), i.e., new data source and the strategy is re-fitted, are usually wider than the confidence intervals for the future data scenario (a), i.e., already seen data source and no re-fitting of the strategy.
Technically, all expected loss estimates proposed in Table 3 are (conditional) mean estimates. The confidence intervals for and are obtained as standard confidence intervals for a (conditionally) independent sample mean: is considered to be the mean of the independent samples (varying over ). is considered to be the mean of the conditionally independent samples (varying over , and conditioned on ). Confidence intervals for are obtained averaging the estimated variances of independent summands , which corresponds to the plug-in estimate obtained from the equality (all variances conditional on the ).
3.2 Aggregate based performance quantifiers and confidence intervals
For aggregate based performance quantifiers, performances and their confidence intervals are estimated from the sample of loss/score evaluates. We will denote the elements in this sample by (for notation on RHS see Section 2.5.2). We note that unlike in the case of average based evaluation, there is no running index for the test set data point, only indices for the data set and for the prediction strategy.
|estimate||estimates||f.d.s.||standard error estimate||CLT in|
Table 4 presents one estimate of expected loss estimates with proposed standard error estimate, for future data situation (c), i.e., generalization of performance to a new dataset. Even though there is only a single estimate, we present it in a table for concordance with Table 3. An confidence interval at confidence is obtained as
The mean and variance estimates are obtained from standard theory of mean estimation, by the same principle as for average based estimates. Estimates for situations (a) may be naively constructed from multiple test sets of the same size, or obtained from further assumptions on via re-sampling, though we abstain from developing such an estimate as it does not seem to be common - or available - at the state-of-art.
3.3 Rank based performance quantifiers
mlaut has functionality to compute rankings based on any average or aggregate performance statistic, denoted below. I.e., for any choice of , the following may be computed.
As in Section 2.5.3, define in the case the raw statistic being an average, and in case it is an aggregate. Denote by the order rank of within the tuple .
|estimate||estimates||f.d.s.||standard error estimate||CLT in|
Table 5 presents an average rank estimates and an average rank difference estimate, for future data situation (c), i.e., generalization of performance to a new dataset.
The average rank estimate and its standard error is based on the central limit theorem in the number of data sets. The average rank difference estimate is Neményi’s critical difference as referred to in  which is used in visualizations.
3.4 Statistical tests for method comparison
While the methods in previous sections compute performances with confidence bands, they do not by themselves allow to compare methods in the sense of ruling out that differences are due to randomness (with the usual statistical caveat that this can never be ruled out entirely, but the plausibility can be quantified).
mlaut implements significance tests for two classes of comparisons: absolute performance differences, and average rank differences, in future data scenario (c), i.e., with a guarantee for the case where the strategy is re-fitted to a new data source.
mlaut’s selection follows closely, and our exposition below follows loosely, the work of . While the latter is mainly concerned with classifier comparison, there is no restriction-in-principle to leverage the same testing procedures for quantitative comparison with respect to arbitrary (average or aggregate) raw performance quantifiers.
3.4.1 Performance difference quantification
The first class of tests we consider quantifies, for a choice of aggregate or average loss , the significance of average differences of expected generalization performances, between two strategies and . The meanings of “average” and “significant” may differ, and so does the corresponding effect size - these are made precise below.
All the tests we describe are based on the paired differences of performances, where the pairing considered is the pairing through datasets. That is, on dataset , there are performances of strategy and which are considered as a pair of performances.
For the paired differences, we introduce abbreviating notation if the performance is an average loss/score, and if the loss is an aggregate loss/score. Non-parametric tests below will also consider the ranks of the paired differences, we will write for the rank of within the sample , i.e., taking values between and .
We denote by and the respective population versions, i.e., the performance difference on a random future dataset, as in scenario (c).
Table of pairwise comparison tests for benchmark comparison. name = name of the testing procedure. tests null = the null hypothesis that is tested by the testing procedure. e.s.(raw) = the corresponding effect size, in raw units. e.s.(norm) = the corresponding effect size, normalized. stat. = the test statistic which is used in computation of significance. Symbols are defined as in the previous sections.
Table 6 lists a number of common testing procedures. The significances may be seen as guarantees for future data situation (c). The normalized effect size for the paired t-test comparing the performance of strategies and , the quantity in Table 6, is called Cohen’s d(-statistic) for paired samples (to avoid confusion in comparison with literature, it should be noted that Cohen’s d-statistic also exists for unpaired versions of the t-test which we do not consider here in the context of performance comparison). The normalized effect size for the Wilcoxon signed-rank test, the quantity , is called biserial rank correlation, or rank-biserial correlation.
It should also be noted that the Wilcoxon signed-rank test, while making use of rank differences, is not a pairwise comparison of strategies’ performance ranks - this is a common misunderstanding. While “ranks” appear in both concepts, the ranks in the Wilcoxon signed-rank tests are the ranks of the performance differences, pooled across data sets, while in a rank based performance quantifier, the ranking of different methods’ performances (not differences) within a data sets (not across data sets) is considered.
Portmanteau tests for the above may be based on parametric ANOVA, though  recommends avoiding these due to the empirical asymmetry and non-normality of loss distributions. Hence for multiple comparisons, mlaut implements Bonferroni and Bonferroni-Holm significance correction based post-hoc testing.
In order to compare the performance of the prediction functions one needs to perform statistical tests on the output produced by . Below we enumerate the statistical tests that can be employed to assess the results produced by the loss functions as described in 2.5.1.
3.4.2 Performance rank difference quantification
Performance rank based testing uses the observed performance ranks of the -th strategy, on the -th data set. These are defined as above in Section 3.3, of which we keep notation, including notation for the average rank estimate . We further introduce abbreviating notation for rank differences, .
|test||(for some )|
Table 7 describes common testing procedures which may both be seen as tests for a guarantee of expected rank difference in future data scenario (c). The sign test is a binomial test regarding the proportion being significantly different from . In case of ties, a trinomial test is used. The implemented version of the Friedman test uses the F-statistic (and not the Q-statistic aka chi-squared-statistic) as described in .
For post-hoc comparison and visualization of average rank differences, mlaut implements the combination of Bonferroni and studentized rannge multiple testing correction with Neményi’s confidence intervals, as described in 3.3.
4 MLaut, API Design and Main Features
MLaut  is a modelling and workflow toolbox that was written with the aim of simplifying the task of running machine learning benchmarking experiments. MLaut was created with the specific use-case of large-scale performance evaluation on a large number of real life datasets, such as the study of . Another key goal was to provide a scalable and unified high-level interface to the most important machine learning toolboxes, in particular to include deep learning models in such a large-scale comparison..
Below, we describe package design and functionality. A short usage handbook is included in Section 4.5
MLaut may be obtained from pyPI via pip install mlaut, and is maintained on GitHub at github.com/alan-turing-institute/mlaut. A Docker container can also be obtained from Docker Hub via docker pull kazakovv/mlaut.
4.1 Applications and Use
MLaut main use case is the set-up and execution of supervised (classification and regression) benchmarking experiments. The package currently provides an high-level workflow interface to scikit-learn and keras models, but can easily be extended by the user to incorporate model interfaces from additional toolboxes into the benchmarking workflow.
MLaut automatically creates begin-to-end pipeline for processing data, training machine learning experiments, making predictions and applying statistical quantification methodology to benchmark the performance of the different models.
More precisely, MLaut provides functionality to:
Automate the entire workflow for large-scale machine learning experiments studies. This includes structuring and transforming the data, selecting the appropriate estimators for the task and data data at hand, tuning the estimators and finally comparing the results.
Fit data and make predictions by using the prediction strategies as described in 5.4 or by implementing new prediction strategies.
Evaluate the results of the prediction strategies in a uniform and statistically sound manner.
4.2 High-level Design Principles
We adhered to the high-level API design principles adopted for the scikit-learn project . These are:
Non-proliferation of classes.
We were also inspired by the Weka project , a platform widely used for its data mining functionalities. In particular, we wanted to replicate the ease of use of Weka in a pythonic setting.
4.3 Design Requirements
Specific requirements arise from the main use case of scalable benchmarking and the main design principles:
Extensibility. MLaut needs to provide a uniform and consistent interface to level 3 toolbox interfaces (as in Section 1.1). It needs to be easily extensible, e.g., by a user wanting to add a new custom strategy to benchmark.
Data collection management. Collections of data sets to benchmark on may be found on the internet or exist on a local computer. MLaut needs to provide abstract functionality for managing such data set collections.
Algorithm/model management. In order to match algorithms with data sets, MLaut needs to have abstract functionality to do so. This needs to include sensible default settings and easy meta-data inspection of standard methodology.
Orchestration management. MLaut needs to conduct the benchmarking experiment in a standardized way with minimal user input beyond its specification, with sensible defaults for the experimental set-up. The orchestration module needs to interact with, but be separate from the data and algorithm interface.
User Friendliness. The package needs to be written in a pythonic way and should not have a steep learning curve. Experiments need to be easy to set-up, conduct, and summarize, from a python console or a jupyter notebook.
In our implementation of MLaut, we attempt to address the above requirements by creating a package which:
Has a nice and intuitive scripting interface. One of our main requirements was to have a native Python scripting interface that integrates well with the rest of our code. Our design attempts to reduce user interaction to the minimally necessary interface points of experiment specification, running of experiments, and querying of results.
Provides a high level of abstraction form underlying toolboxes. Our second criteria was that MLaut provided high level of abstraction from underlying toolboxes. One of our main requirements was for MLaut to be completely model and toolbox agnostic. The scikit-learn interface was too light-weight for our purposes as its parameter and meta-data management is not interface explicit (or inspectable).
Provides Scalable workflow automation. This needed to be one of MLaut’s cornerstone contributions. Its main logic is implemented in the orchestrator class that orchestrates the evaluation of all estimators on all datasets. The class manages resources for building the estimator models, saving/loading the data and the estimator models. It is also aware of the experiment’s partial run state and can be used for easy resuming of an interrupted experiment.
Allows for easy estimator construction and retrieval. The end user of the package should be able to easily add new machine learning models to the suite of build in ones in order to expand its functionality. Besides a small number of required methods to implement, we have provided interfaces to two of the most used level 3 toolbox packages, sklearn and keras.
Has a dedicated meta-data interface for sensible defaults of estimators. We wanted to ensure that the estimators that are packaged in MLaut come with sensible defaults, i.e. pre-defined hyper-parameters and tuning strategies that should be applicable in most use cases. The robustness of these defaults has been tested and proven as part of the original large-scale classification study. As such, the user is not required to have a detailed understanding of the algorithms and how they need to be set up, in order to make full use them.
Provides a framework for quantitative benchmark reporting. Easily accessible evaluation methodology for the benchmarking experiments is one of the key features of the package. We also considered reproducibility of results as vital, reflected in a standardized set-up and interface for the experiments, as well as control throughout of pseudo-random seeds..
Orchestrates the experiments and parallelizes the load over all available CPU cores. A large benchmarking study can be quite computationally expensive. Therefore, we needed to make sure that all available machine resources are fully utilized in the process of training the estimators. In order to achieve this we used the parallelization methods that are available as part of the GridSearch method and natively with some of the estimators. Furthermore, we also provide a Docker container for running MLaut which we recommend using as a default as it allows the package to run in the background at full load.
Provides a uniform way of storing a retrieving data. Results of benchmarking experiments needed to be saved in a uniform way and made available to users and reviewers of the code. At the current stage, we implemented back-end functionality for management via local HDF5 database files. In the future, we hope to support further data storage back-ends with the same orchestrator-sided facade interface.
4.3.1 Estimator encapsulation
MLaut implements a logic of encapsulating the meta-data with the estimators that it pertains to. This is achieved by using a decorator class that is attached to each estimator class. By doing this, our extended interface is are able to bundle wide-ranging meta-data information with each estimator class. This includes:
Basic estimator properties such as name, estimator family;
Types of tasks that a particular estimator can be applied to;
The type of data which the estimator expects or can handle;
The model architecture (on level 3, as in Section 1.1). This is particularly useful for more complex estimators such as deep neural networks. By applying the decorator structure the model architecture can be easily altered without changing the underlying estimator class.
This extended design choice has significant benefits for a benchmarking workflow package. First of all, it allows fsearching for estimators based on some basic criteria such as task or estimator family. Second of all, it allows to inspect, query, and change default hyper-parameter settings used by the estimators. Thirdly, strategies with different internal model architectures can be deployed with relative ease.
4.3.2 Workflow design
The workflow supported by MLaut consists of the following main steps:
Data collection. As a starting point the user needs to gather and organize the datasets of interest on which the experiments will be run. The raw datasets need to be saved in a HDF5 database. Metadata needs to be attached to each dataset which is later used in the training phase for example for distinguishing the target variables. MLaut provides an interface for manipulating the databases through its Data and Files_IO classes. The logic of the toolbox is to provision two HDF5 databases one for storing the input data such as the datasets and a second one to store the output of the machine learning experiments and processed data such as train/test index splits. This separation of input and output is not required but is recommended. The datasets also need to be split in a train and test set in advance of proceeding with the next phase in the pipeline. The indices of the train and test splits are stored separately from the actual datasets in the HDF5 database to ensure data integrity and reproducibility of the experiments. All estimators are trained and tuned on the training set only. At the end of this process the estimators are used on the test sets which guarantees that all predictions are made on unseen data.
Training phase. After the datasets are stored in the HDF5 database by following the convention adopted by MLaut the user can proceed to training the estimators. The user needs to provide an array of machine learning estimators that will be used in the training process. MLaut provides a number of default estimators that can be instantiated. This can be done by the use of the estimators module. The package also provides the flexibility for the user to write its own estimator by inheriting from the mlaut_estimator class. Furthermore, there is a generic_estimator module which provides flexibility for the user to create new estimators with only a couple of lines of code.
The task of training the experiments is performed by the experiments. Orchestrator class. This class manages the sequence of the training the the parallelization of the load. Before training each dataset is preprocessed according to metadata provided on the estimator level. This includes normalizing the features and target variables, conversion from categorical to numerical values.
We recommend running the experiments inside a Docker container if they are very computationally intensive. This allows MLaut to run in the background on a server without shutting down unexpectedly due to loss of connection. We have provided a Docker image that makes this process easy.
Making predictions. During training the fitted models are stored on the hard drive. At the end of the training phase the user can again use experiments.Orchestrator class to retrieve the trained models and make predictions on the test sets.
Analyse results. The last stage is analysing the output of the results of the machine learning experiments. In order to initiate the process the user needs to call the analyze_results.prediction_errors method which returns two dictionaries with the average errors per estimator on all datasets as well as the errors per estimator achieved on each dataset. These results can be used as inputs to the statistical tests that are also provided as part of the analyze_rezults module which mostly follow the methodology proposed by .
4.4 Software Interface and Main Toolbox Modules
MLaut is built around the logic of the pipeline workflow described earlier. Our aim was to implement the programming logic for each step of the pipeline in a different module. The code that is logically used in more than one of the stages is implemented in a Shared module that is accessible by all other classes. The current design pattern is most closely represented by the façade and adaptor patterns under which the user interacts with one common interface to access the underlying adaptors which represent the underlying machine learning and statistical toolboxes.
4.4.1 Data Module
The Data module contains the high level methods for manipulating the raw datasets. It provides a second layer of interface to the lower level classes for accessing, storing and extracting data from HDF5 databases. This module uses heavily the functionality developed in the Shared module but provides a higher level of abstraction for the user.
4.4.2 Estimators Module
This module encompasses all machine learning models that come with MLaut as well as methods for instantiating them based on criteria provided by the user. We created MLaut for the purpose of running supervised classification experiments but the toolbox also comes with estimators that can be used for supervised regression tasks.
From a software design perspective the most notable method in this class is the build method which returns an instantiated estimator with the the appropriate hyper parameter search space and model architecture. In software design terms this approach resembles more closely the builder design pattern which aims at separating the construction of and object from its representation. This design choice allows the base mlaut_estimator class to create different representations of machine learning models.
The mlaut_estimator object includes methods that complete its set of functionalities. Some of the main ones are a save method that takes into account the most appropriate format to persist a trained estimator object. This could include the pickle format used by most scikit-learn estimators or the HDF5 format used by keras. A load function is also available for restoring the saved estimators.
The design of the package also relies on the estimators having a uniform fit and predict methods that takes the same input date and generate predictions in the same format. These methods are not implemented at the mlaut_estimator level but instead we relied on the fact that these fundamental methods will be uniform across the underlying packages. However, there is a discrepancy in the behaviour of the scikit-learn and keras estimators. For classification tasks keras
requires the labels of the training data to be one hot encoded. Furthermore, the default behaviour of the keraspredict method is equivalent to the predict_proba in scikit-learn. We solved these discrepancies by overriding the fit and predict methods of the implemented keras estimators.
Through the use of decorators and by implementing the build method we are able to fully customize the estimator object with minimal required programming. The decorator class allows to set the metadata associated with the estimator. This includes setting the name, estimator family, types of tasks and hyper parameters. This together with an implemented build method will give the user a fully specified machine learning model. This approach also facilitates the application of the algorithms and the use of the software as we can ensure that each algorithm is matched to the correct datasets. Furthermore, this allows to easily retrieve the required algorithms by executing a simple command.
Closely following terminology and taxonomy of , mlaut estimators are currently assigned to one of the following methodological families:
Baseline Estimators. This family of models is also referred to as a dummy estimator and serves as a benchmark to compare other models to. It does not aim to learn any representation of the data but simply adopts a strategy of guessing.
Generalized Linear Model Estimators. A family of models that assumes that a (generalized) linear relationship exists between the dependent and target values.
Prototype Method Estimators
. Family of models that apply prototype matching techniques for fitting the data. The most prominent member of this family is the K-means algorithm.
Kernel Method Estimators
. Family of models using kernelization techniques, including support vector machine based estimazors.
Deep Learning and Neural Network Estimators. This family of models provides implementation of neural network models, including deep neural networks.
Ensembles-of-Trees Estimators. Family of methods that combines the predictions of several tree-based estimators in order to produce a more robust overall estimator. This family is further divided in:
averaging methods. The models in this group average the predictions of several independent models in order to arrive at a combined estimator. An example is Breiman’s random forest.
. An ensembling approach of building models sequentially based on iterative weighted residual fitting. An example are stochastic gradient boosted tree models.
In addition ot this the user also has the option to write their own estimator objects. In order to achieve this the new class needs to inherit from the mlaut_estimator class. and implementing the abstract methods in each child class. The main abstract method that needs to be implemented is the build method which returns an wrapped instance of the estimator with a set of hyper-parameters that will be used in the tuning process. For further details about the implemented estimators refer to 5.4.
4.4.3 Experiments Module
This module contains the logic for orchestration of the machine learning experiments. The main parameters in this module are the datasets and the estimator models that will be trained on the data. The main run method of the module then proceeds to training all estimators on all datasets, sequentially. The core of the method represent two embedded for
loops the first of which iterates over the datasets and the second one over the estimators. Inside the inner loop the orchestrator class builds an estimator instance for each dataset. This allows to tailor the machine learning model for each dataset. For example, the architecture of a deep neural network can be altered to include the appropriate number of neurons based on the input dimensions of the data. This module is also responsible for saving the trained estimators and making predictions. It should be noted that the orchestrator module is not responsible for the parallelization of the experiments which is handled on an individual estimator level.
4.4.4 Result Analysis Module
This module includes the logic for performing the quantitative evaluation and comparison of the machine learning strategies’ performance. The predictions of the trained estimators on the test sets for each dataset serve as input. First, performances and, if applicable, standard errors on the individual data sets are computed, for a given average or aggregate loss/performance quantifier. The samples of performances are then used as inputs for comparative quantification.
API-wise, the framework for assessing the performance of the machine learning estimators hinges on three main classes. The anlyze_results class implements the calculation of the quantifiers. Through composition this class relies on the losses class that performs the actual calculation of the prediction performances over the individual test sets. The third main class that completes the framework design is the scores class. It defines the loss/quantifier function that is used for assessing the predictive power of the estimators. An instance of the scores class is passed as an argument to the losses class.
We believe that this design choice of using three classes is required to provide the necessary flexibility for the composite performance quantifiers as described in Section 3 - i.e., to allow to compute ranks for an arbitrarily chosen loss (e.g., mean rank with resect to mean absolute error), or to perform comparison testing using an arbitrarily chosen performance quantifier (e.g., Wilcoxon signed rank test comparing F1-scores).
Our API also facilitates user custom extension, e.g., for users who wish to add a new score function, an efficient way to compute aggregate scores or standard errors, or a new comparison testing methodology. For example, adding new score functions can be easily achieved by inheriting from the MLautScore abstract base class. On the other hand, the losses class completely encapsulates the logic for the calculation of the predictive performance of the estimators. This is particularly useful as the class internally implements a mini orchestrator procedure for calculating and presenting the loss achieved by all estimators supplied as inputs. Lastly, the suite of statistical tests available in MLaut can be easily expanded by adding the appropriate method to the analyze_results class or a descendant.
Mathematical details of the implemented quantification procedures implemented in MLaut were presented in Section 3. Usage details
In this implementation of MLaut we use third-party packages for performing the statistical tests. We rely mostly on the scikit-learn package. However, for post hoc tests we use the scikit-posthocs package  and the Orange package  which we also used for creating critical distance graphs for comparing multiple classifiers.
4.4.5 Shared Module
This module includes classes and methods that are shared by the other modules in the package. The Files_IO class comprises of all methods for manipulating files and datasets. This includes saving/loading of trained estimators from the HDD and manipulating the HDF5 databases. The Shared module also keeps all static variables that are used throughout the package.
4.5 Workflow Example
We give a step-by-step overview over the most basic variant of the user workflow. Advanced examples with custom estimators and set-ups may be found in the MLaut tutorial .
Step 0: setting up the data set collection
The user should begin by setting up the data set collection via the Files_IO class. Meta-data for each dataset needs to be provided that includes as a minimum the class column name/target attribute and name of dataset. This needs to be done once for every dataset collection, and may not need to be done for a pre-existing or pre-deployed collection. Currently, only local HDF5 data bases are supported.
We have implemented back-end set-up routines which download specific data set collections and generate the meta-data automatically. Current support includes the UCI library data sets and OpenML. Alternatively, the back-end may be populated directly by storing an in-memory pandas DataFrame via the save_pandas_dataset method, e.g., as part of custom loading scripts.
In this case, meta-data for the individual datasets needs to be provided in the following dictionary format:
Step 1: initializing data and output locations
As the next step, the user should specify the back-end links to the data set collections (“input”) and to intermediate or analysis results (“output”).This is done via the data class. It is helpful for code readability to store these in codeinput_io and out_io variables.
These may then be supplied as parameters to preparation and orchestration routines. We then proceed to getting the paths to the raw datasets as well as the respective train/test splits which is performed respectively though the use of list_datasets and split_datasets methods.
Step 2: initializing estimators
The next step is to instantiate the learning strategies, estimators in sklearn terminology, which we want to use in the benchmarking exercise. The most basic and fully automated variant is use of the instantiate_default_estimators method which loads a pre-defined set of defaults given specified criteria. Currently, only a simple string look-up via the estimators parameter is implemented, but we plan to extend the search/matching functionality. The string criterion may be used to fetch specific estimators by a list of names, entire families of models, estimators by task (e.g., classification), or simply all available estimators.
Step 3: orchestrating the experiment
The final step is to run the experiment by passing references to data and estimators to the orchestrator class, then initating the training process by invoking its run method.
Step 4: computing benchmark quantifiers
After the estimators are trained and the predictions of the estimators are recorded we can proceed to obtaining quantitative benchmark results for the experiments.
For this, we need to instantiate the AnalyseResults class by supplying the folders where the raw datasets and predictions are stored. Its prediction_errors method may be invoked to returns both the calculated prediction performance quantifiers, per estimator as well as the prediction performances per estimator and per dataset.
The prediction errors per dataset per estimator can be directly examined by the user. On the other hand, the estimator performances may be used as further inputs for comparative quantification via hypothesis tests. For example, we can perform a paired t-test for pairwise comparison of methods by invoking the code below:
5 Using MLaut to Compare the Performance of Classification Algorithms
As an major test use case for MLaut, we conducted a large-scale benchmark experiment comparing a selection of off-shelf classifiers on datasets from the UCI Machine Learning Repository. Our study had four main aims:
stress testing the MLaut framework on scale, and observing the user interaction workflow in a major test case.
replicating the key points of the experimental set-up by , while avoiding their severe mistake of tuning on the test set.
including deep learning methodology to the experiment.
Given the above, the below benchmarking study is, to the best of our knowledge, the first large-scale supervised classification study which333in the disjunctive sense: i.e., to the best of our knowledge, the first large-scale benchmarking study which does any of the above rather than being only the first study to do all of the above.:
is correctly conducted via out-of-sample evaluation and comparison. This is since  commit the mistake of tuning on the test set, as it is even acknowledged in their own Section 3 Results and Discussion.
includes contemporary deep neural network classification approaches, and is conducted on a broad selection of classification data sets which is not specific to a special domain such as image classification (the UCI dataset collection).
We intend to extend the experiment in the future by including further dataset collections and learning strategies.
Full code for our experiments, including random seeds, can be found as a jupyter notebook in MLaut’s documentation .
5.1 Hardware and software set-up
The benchmark experiment was conducted on a Microsoft Azure VM with 16 CPU cores and 32 GB of RAM, by our Docker virtualized implementation of MLaut. The experiments ran for about 8 days. MLaut requires Python 3.6 and should be installed in a dedicated virtual environment in order to avoid conflicts or the Docker implementation should be used. The full code for running the experiments and the code for generating the results in results Appendix A can be found in the examples directory in the GitHub repository of the project.
5.2 Experimental set-up
5.2.1 Data set collection
The benchmarking study uses the same dataset collection as employed by 
. This collection consists of 121 tabular datasets for supervised classification, taken directly from the UCI machine learning repository. Prior to the experiment, each dataset was standardized, such that each individual feature variable has a mean of 0 and a standard deviation of 1.
The dataset collection of  intends to be representative of a wide scope of basic real-world classification problems. It should be noted that this representative cross-section of simple classification tasks excludes more specialized tasks such as image, audio, or text/document classification which are usually regarded to by typical applications of deep learning, and for which deep learning is also the contemporary state-of-art. For a detail description of the Fernández-Delgado et al.  data collection, see section 2.1 there.
5.2.2 Re-sampling for evaluation
Each dataset is in split into exactly one pair of training and test set. The training sets are selected, for each data set, uniformly at random444independently for each dataset in the collection as (a rounded) of the available data sample; the remaining in the dataset form the test set on which the strategies are asked to make predictions. Random seeds and the indices of the exact splits were saved to ensure reproducibility and post-hoc scrutiny of the experiments.
The training set may (or may not) be further split by the contender methods for tuning - as stated previously in Section 2, this is not enforced as part of the experimental set-up555Unlike in the set-up of  which, on top of doing so, is also faulty., but is left to each learning strategy to deal with internally, and will be discussed in the next section. In particular, none of the strategies have access to the test set for tuning or training.
5.3 Evaluation and comparison
We largely followed the procedure suggested by  for the analysis of the performance of the trained estimators. For all classification strategies, the following performance quantifiers are computed per dataset:
rank of misclassification loss
Averages of these are computed, with standard errors for future data situation (c: re-trained, on unseen dataset). In addition, for the misclassification loss on each data set, standard errors for future data situation (a: re-used, same dataset) are computed.
The following pairwise comparisons between samples of performances by dataset are computed:
paired t-test on misclassification losses, with Bonferroni correction
(paired) Wilcoxon signed rank on misclassification losses, with Bonferroni correction
Friedman test on ranks, with Neményi’s significant rank differences and post-hoc significances
Detail descriptions of these may be found in Section 3.
5.4 Benchmarked machine learning strategies
Our choice of classification strategies is not exhaustive, but is meant to be representative of off-shelf choices in the scikit-learn and keras packages. We intend to extend the selection in future iterations of this study.
From scikit-learn, the suite of standard off-shelf approaches includes linear models, Naive Bayes, SVM, ensemble methods, and prototype methods.
We used keras to construct a number of neural network architectures representative of the state-of-art. This proved a challenging task due to the lack of explicitly recommended architectures for simple supervised classification to be found in literature.
5.4.1 Tuning of estimators
It is important to note that the off-shelf choices and their default parameter settings are often not considered good or state-of-art: hyper-parameters in scikit-learn are by default not tuned, and there are no default keras that come with the package.
For scikit-learn classifiers, we tune parameters using scikit-learn’s GridSearchCV wrapper-compositor (which never looks at the test set by construction).
In all cases of tuned methods, parameter selection in the inner tuning loop is done via grid tuning by 5-fold cross-validation, with respect to the default score function implemented at the estimator level. For classifiers as in our study, the default tuning score is mean accuracy (averaged over all 5 tuning test folds in the inner cross-validation tuning loop), which is equivalent to tuning by mean misclassification loss.
The tuning grids will be specified in Section 5.4.2 below.
classifiers, we built architectures by interpolating general best practice recommendations in scientific literature, as well as based on concrete designs found in software documentation or unpublished case studies circulating on the web. We further followed the sensible default choices of keras whenever possible.
The specific choices for neural network architecture and hyper-parameters are specified in Section 5.4.3 below.
5.4.2 Off-shelf scikit-learn supervised strategies
Algorithms that do not have any tunable hyperparameters
Estimator name sklearn.dummy.DummyClassifier Description This classifier is a naive/uninformed baseline and always predicts the most frequent class in the training set (“majority class”). This corresponds to the choice of the most_frequent parameter. Hyperparameters None Estimator name sklearn.naive_bayes.BernoulliNB Description Naive Bayes classifier for multivariate Bernoulli models. This classifier assumes that all features are binary, if not they are converted to binary. For reference please see  Chapter 2. Hyperparameters None Estimator name sklearn.naive_bayes.GaussianNB Description Standard implementation of the Naive Bayes algorithm with the assumption that the features are Gaussian. For reference please see  Chapter 2. Hyperparameters None
Estimator name sklearn.linear_model.PassiveAggressiveClassifier Description Part of the online learning family of models based on the hinge loss function. This algorithm observes feature-value pairs in sequential manner. After each observation the algorithm makes a prediction, checks the correct value and calibrates the weights. For further reference see . Hyperparameters C: array of 13 equally spaced numbers on a log scale in the range scikit-learn default: 1
Estimator name sklearn.neighbors.KNeighborsClassifier Description The algorithm uses a majority vote of the nearest neighbours of each data point to make a classification decision. For reference see  and , Chapter 2. Hyperparameters n_neighbors=[1;30], scikit-learn default: 5 p=[1,2], scikit-learn default:2
Estimator name sklearn.svm.SVC Description This estimator is part of the Support Vector family of algorithms. In this study, we use the Gaussian kernel only. For reference see  and , Chapter 7. The performance of support vector machine is very sensitive with respect to tuning parameters:
C, the regularization parameter. There does not seem to be a consensus in the community regarding the space for the C hyper-parameter search. In an example666At the time of writing this paper the example was available on this link: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html the scikit-learn documentation refers to an initial hyper-parameter search space for C in the range . However, a different example777At the time of writing this paper the example was available on this link: http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html suggests . A third scikit-learn example888At the time of writing this paper the example was available on this link: https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py suggests testing for both the linear and rbf kernels and broad values for the C and parameters. Other researches  suggest to use apply a search for C in the range which we used in our study as it provides a good compromise between reasonable running time and comprehensiveness of the search space.
, the inverse kernel bandwith. The scikit-learn example999At the time of writing this paper the example was available on this link: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html  suggests hyper-parameter search space for in the range . However, a second scikit-learn example101010At the time of writing this paper the example was available on this link: http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html suggest to search only in . On the other hand,  suggest searching for in the range which again we found to be the middle ground and applied in our study.
Hyperparameters C: array of 13 equally spaced numbers on a log scale in the range , scikit-learn default: 1. gamma: array of 13 equally spaced numbers on a log scale in the range , scikit-learn default: auto
The three main models that we used in this study that are part of this family are the RandomForest, Bagging and Boosting. The three models are built around the logic of using the predictions of a large number of weak estimators, such as decision trees. As such they share a lot of the same hyperparameters. Namely, some of the main parameters for this family of models are the number of estimators, max number of features and the maximum tree depth, default values for each estimator which are suggested in thescikit-learn package. Recent research  and informal consensus in the community suggest that the performance gains from deviating from the default parameters are rewarded for the Boosting algorithm but tend to have limited improvements for the RandomForest algorithm. As such, for the purposes of this study we will focus our efforts to tune the Boosting and Bagging algorithms but will use a relatively small parameter search space for tuning RandomForest.
Estimator name sklearn.ensemble.GradientBoostingClassifier Description Part of the ensemble meta-estimators family of models. We used the default sklearn deviance loss. The algorithm fits a series of decision trees on the data and predictions are made based on a majority vote. At each iteration the data is modified by applying weights to it and predictions are made again. At each iteration the weights of the incorrectly pairs are increased (boosted) and decreased for the correctly predicted pairs. As per the scikit-learn documentation  This estimator is not recommended for datasets with more than two classes as it requires the introduction of regression tress at each iteration. The suggested approach is to use the RandomForest algorithm instead. A lot of the datasets used in this study are multiclass supervised learning problems. However, for the purposes of this study we will use the Gradient Boosting algorithm in order to see how it performs when benchmarked to the suggested approach. For reference see  and , Chapter 17. Hyperparameters number of estimators: , scikit-learn default: 100. max depth: integers in the range , scikit-learn default: 3. Estimator name sklearn.ensemble.RandomForestClassifier Description Part of the ensemble meta-estimators family of models. The algorithm fits decision trees on sub-samples of the dataset. The average voting rule is used for making predictions. For reference see  and , Chapter 17. Hyperparameters For this study we used the following hyperparameter grid: number of estimators: , scikit-learn default: 10 max features: [auto, sqrt, log2, None], scikit-learn default: auto max depth: , scikit-learn default: None Estimator name sklearn.ensemble.BaggingClassifier Description Part of the ensemble meta-estimators family of models. The algorithm draws with replacement feature-label pairs , trains decision base tree estimators and makes predictions based on voting or averaging rules. For reference see  and , Chapter 17. Hyperparameters number of estimators: , scikit-learn default: 10
5.4.3 Keras neural network architectures including deep neural networks
We briefly summarize our choices for hyper-parameters and architecture.
Architecture. Efforts have been made to make the choice of architectures less arbitrary by suggesting algorithms for finding the optimal neural network architecture . Other researches have suggested good starting points and best practices that one should adhere to when devising a network architecture [25, 38]. We also followed the guidelines of , in particular Chapter 6.4. The authors conducted an empirical study and showed that the accuracy of networks increases as the number of layers grow but the gains are diminishing rapidly beyond 5 layers. These findings are also confirmed by other studies  that question to need to use very deep feed-forward networks. In general, the consensus in the community seems to be that 2-4 hidden layers are sufficient for most feed-forward network architectures. One notable exception to this rule seem to be convolutional network architecture which have been showed to perform best when several sequential layers are stacked one after the other. However, this study does not
make use of convolutional neural networks, as our data is not suitable for these models, in particular because there is no well-specified way to transform samples into a multidimensional array form. The architectures are given below as their keras code specification.
Regularization. We employ the current state-of-art in neural network regularization: dropout. In the absence of clear rules when and where dropout should be applied, we include two versions of each neural network in the study: one version not using dropout, and one using dropout. Dropout regularization is as described by Hinton et al. , Srivastava et al.  where its potential for improving the generalization accuracy of neural networks is shown. We used a dropout rate of 0.5 as suggested by the authors.
Hyper-parameter tuning. We did not perform grid search to find the optimal hyper parameters for the network. The reason for this is two-fold. We interfaced the neural network models from keras. The keras interface is not fully compatible with scikit learn’s GridSearch, nor does it provide easy off-shelf tuning facilities (see subsection 4.4.2 for details). Furthermore, using grid search tuning does not seem to be considered common practice by the community, and it is even actively recommended to avoid by some researchers 
, hence might not be considered a fair representation of the state-of-art. Instead, the prevalent practice seems to be manual tuning of hyper-parameters based on learning curves. Following the latter in the absence of off-shelf automation, we manually tuned learning rate, batch size, and number of epochs by manual inspection of learning curves and performances on the fulltraining sets (see below).
Learning Rate. The learning rate is one of the crucial hyper-parameter choices when training neural networks. The generally accepted rule to find the optimal rate is to start with a large rate and if the training process does not diverge decrease the learning rate by a factor of 3 . This approach is confirmed by  who also affirm that a larger learning rate can be used in conjunction with dropout without risking that the weights of the model blow out.
Batch Size. The datasets used in the study were relatively small and could fit in the memory of the machine that we used for training the algorithms. As a result we set the batch size to equal the entire dataset which is equivalent to full gradient descent.
Number of epochs. We performed manual hyper-parameter selection by inspection of individual learning curves for all combinations of learning rate and architecture. For this, learning curves on individual data sets’ training samples were inspected visually for the “plateau range” (range of minimal training error). For all architectures, and most data sets, the plateau was already reached for one single epoch, and training error usually tended to increase in the range of 50-500 epochs. The remaining, small number of datasets (most of which were of 4-or-above-digit sample size) plateaued in the 1-digit range.
While this is a very surprising finding as it corresponds to a single gradient descent step, it is what we found, while following what we consider the standard manual tuning steps for neural networks. We further discuss this in Section 5.6 and acknowledge that this surprising finding warrants further investigation, e.g., through checking for mistakes, or including neural networks tuned by automated schemes.
Thus, all neural networks architectures were trained for one single epoch - since choosing a larger (and more intuitive number of epochs) would have been somewhat arbitrary, and not in concordance with the common manual tuning protocol.
For the keras models, we adopted six neural network architectures with varying depths and widths. Our literature review revealed that there is no consistent body of knowledge or concrete rules pertaining to constructing neural network models for simple supervised classification (as opposed to image recognition etc). Therefore, we extrapolated from general best practice guidelines as applicable to our study, and also included (shallow) network architectures that were previously used in benchmark studies. The full keras architecture of the neural networks used are listed below.
|Description||Own architecture of Deep Neural Network model applying the principles highlighted above. For this experiment we made used of the empirical evidence that networks of 3-4 layers were sufficient to learn any function discussed in . However, we opted for a slightly narrower network in order to investigate whether wider nets tend to perform better than narrow ones.|
|Hyperparameters||batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy.|
|Description||In this architecture we experimented with the idea that wider networks perform better than narrower ones. No dropout was performed in order to test the idea that regularization is necessary for all deep neural network models.|
|Hyperparameters||batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy.|
|Description||We tested the same architecture as above but applying dropout after the first two layers.|
|Hyperparameters||batch size: None, learning , loss: mean squared error, optimizer: Adam, metrics: accuracy.|
|Description||Deep Neural Network model inspired from architecture suggested by :|
|Hyperparameters||batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy.|
|Description||Deep Neural Network model suggested in  with the following architecture:|
|Hyperparameters||batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy.|
|Description||Deep Neural Network model suggested in  with the following architecture:|
|Hyperparameters||batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy.|
Table 8 shows an summary overview of results.
|avg_rank||avg_score||std_error||avg training time (in sec)|
Figure 5 summarizes the samples of performances in terms of classification accuracy. The sample is performance by method, ranging over data sets, averaged over the test sample within each dataset - i.e., the size of the sample of performance equals the number of data sets in the collection.
The Friedman test was significant at level p=2e-16. Figure 6 displays effect sizes, i.e., average ranks with Neményi’s post-hoc critical differences.
From all the above, the top five algorithms among the contenders were the Random Forest, SVC, Bagging, K Neighbours and Gradient Boosting classifiers.
Further benchmarking results may be found in the automatically generated Appendix A. These include results of paired t-tests and Wilcoxon signed rank tests. Briefly summarizing these: Neither t-test (Appendix A.1), nor the Wilcoxon signed rank test (Appendix A.2), with Bonferroni correction (adjacent strategies and all vs baseline), in isolation, are able to reject the null hypothesis of a performance difference between any two of the top five performers.
We discuss our findings below, including a comparison with the benchmarking study by Fernández-Delgado et al. .
5.6.1 Key findings
In summary, the key findings of the benchmarking study are:
MLaut is capable of carrying out large-scale benchmarking experiments across a representative selection of off-shelf supervised learning strategies, including state-of-art deep learning models, and a selection of small-to-moderate-sized basic supervised learning benchmark data sets.
On the selection of benchmark data sets representative for basic (non-specialized) supervised learning, the best performing algorithms are ensembles of trees and kernel-based algorithms. Neural networks (deep or not) perform poorly in comparison.
Of the algorithms benchmarked, grid-tuned support vector classifiers are the most demanding of computation time. Neural networks (deep or not) and the other algorithms benchmarked require computation time in a comparable orders of magnitude.
The main limitations of our study are:
restriction to the Delgado data set collection. Our study is at most as representative for the methods’ performance as the Delgado data set collection is for basic supervised learning.
training the neural networks for one epoch only. As described in 5.4.3 we believe we arrived at this choice following standard tuning protocol, but it requires further investigation, especially to rule out a mistake - or to corroborate evidence of a potential general issue of neural networks with basic supervised learning (i.e., not on image, audio, text data etc).
A relative small set of prediction strategies. While our study is an initial proof-of-concept for MLaut on commonly used algorithms, it did not include composite strategies (e.g., full pipelines), or the full selection available in state-of-art packages.
5.6.3 Comparison to the study of Delgado et al
In comparison to the benchmarking study of Fernández-Delgado et al. , for most algorithms we find comparable performances which are within 95% confidence bands (of ours). A notable major departure is performance of the neural networks, which we find to be substantially worse. The latter finding may be plausibly explained by at least one of the following:
In additional comparison, the general rankings (when disregarding the neural networks) are similar. Though, since a replication of rankings is dependent on conducting the study on exactly the same set of strategies, we are only able to state this qualitatively. Conversely, our confidence intervals indicate that rankings in general are very unstable on the data set collection, as roughly a half of the 179 classifiers which Fernández-Delgado et al.  benchmarked seem to be within 95% confidence ranges of each other.
This seems to highlight the crucial necessity of reporting not only performances but also confidence bands, if reasoning is to be conducted about which algorithmic strategies are the “best” performing ones.
Our findings corroborate most of the findings of the major existing benchmarking study of Fernández-Delgado et al. . In addition, we validate the usefulness of MLaut to easily conduct such a study.
As a notable exception to this confirmation of results, we find that neural networks do not perform well on “basic” supervised classification data sets. While it may be explained by a bias that Fernández-Delgado et al.  introduced into their study by the mistake of tuning on the test set, it is still under the strong caveat that further investigation needs to be carried out, in particular with respect to the tuning behaviour of said networks, and our experiment not containing other mistakes.
However, if further investigation confirms our findings, it would be consistent with the findings of one of the original dropout papers , in which the authors also conclude that the improvements are more noticeable on image datasets and less so on other types of data such as text. For example, the authors found that the performance improvements achieved on the Reuters RCV1 corpus were not significant in comparison with architectures that did not use dropout. Furthermore, at least in our study we found no evidence to suggest that deep architectures performed better than shallow ones. In fact the 12 layer deep neural network architecture ranked just slightly better than our baseline classifier. Our findings also may suggest that wide architectures tend to perform better than thin ones on our training data. It should also be pointed out that the datasets we used in this experiment were relatively small in size. Therefore, it could be argued that deep neural networks can easily overfit such data, the default parameter choices and standard procedures are not appropriate - especially since such common practice may arguably be strongly adapted to image/audio/text data.
In terms of training time, the SVC algorithm proved to be the most expensive, taking on average almost 30 min to train in our set-up. However, it should be noted that this is due to the relatively large hyper-parameter search space that we used. On the other hand, among the top five algorithms the Bagging Classifier was one of the least expensive ones to train taking an average of only 5 seconds. Our top performer, the Random Forest Classifier, was also relatively inexpensive to train taking an average of only 14 seconds.
As our main finding, however, we consider the ease with which a user may generate the above results, using MLaut. The reader may (hopefully) convince themselves of this by inspecting the code and jupyter notebooks in the repository . We are also very appreciative of any criticism, or suggestions for improvement, made (say, by an unconvinced reader) through the project’s issue tracker.
-  scikit-learn laboratory. https://skll.readthedocs.io. URL https://skll.readthedocs.io.
- Abadi et al.  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
- Ba and Caruana  Lei Jimmy Ba and Rich Caruana. Do Deep Nets Really Need to be Deep? arXiv:1312.6184 [cs], 2013.
- Bengio  Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. arXiv:1206.5533 [cs], 2012.
- Bergstra and Bengio  James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 2012.
- Bischl et al.  Bernd Bischl, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016. URL http://www.jmlr.org/papers/v17/15-066.html.
- Bishop  Christopher Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006. ISBN 78-1-4939-3843-8.
- Breiman  Leo Breiman. Bagging predictors. Machine Learning, 1996.
- Breiman  Leo Breiman. Random Forests. Machine Learning, 2001.
- Buitinck et al.  Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake Vanderplas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. Api design for machine learning software: experiences from the scikit-learn project. CoRR, 2013.
- Chen et al.  Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 [cs], 2015.
- Chollet  François Chollet. Keras, 2015. URL https://keras.io.
- Cortes and Vapnik  Corinna Cortes and Vladimir Vapnik. Support-Vector Networks. Machine Learning, 1995.
- Cover and Hart  Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1967.
- Crammer et al.  Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. The Journal of Machine Learning Research, 2006.
- Demšar  Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006.
- Demšar et al.  Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research, 2013.
- Efron and Hastie  Bradley Efron and Trevor Hastie. Computer Age Statistical Inference: Algorithms, Evidence and Data Science. Institute of Mathematical Statistics Monographs. Cambridge University Press, Cambridge, 2016.
- Fernández-Delgado et al.  Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 2014.
- Feurer et al.  Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.
- Friedman  Jerome Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 2001.
- Goodfellow et al.  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
- Gupta and Raza  Tarun Kumar Gupta and Khalid Raza. Optimizing Deep Neural Network Architecture: A Tabu Search Based Approach. arXiv:1808.05979 [cs, stat], 2018.
- Hall et al.  Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
- Hasanpour et al.  Seyyed Hossein Hasanpour, Mohammad Rouhani, Mohsen Fayyaz, and Mohammad Sabokrou. Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures. arXiv:1608.06037 [cs], 2016.
- Hinton et al.  Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580 [cs], 2012.
- Hsu et al.  Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classification. page 16, 2003.
- Huang et al.  Jin Huang, Jingjing Lu, and Charles Ling. Comparing naive Bayes, decision trees, and SVM with AUC and accuracy. IEEE Comput. Soc, 2003.
- Jagtap and Kodge  Sudhir Jagtap and Bheemashankar Kodge. Census Data Mining and Data Analysis using WEKA. arXiv:1310.4647 [cs], 2013.
- James et al.  Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Introduction to Statistical Learning. Springer Publishing Company, Incorporated, 2013. ISBN 978-1-4614-7137-0.
- Jia et al.  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding, 2014.
- Kazakov and Király  Viktor Kazakov and Franz Király. mlaut: Machine Learning automation toolbox, 2018. URL https://github.com/alan-turing-institute/mlaut.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017.
Ojha et al. 
Varun Kumar Ojha, Ajith Abraham, and Václav Snášel.
Metaheuristic Design of Feedforward Neural Networks: A
Review of Two Decades of Research.
Engineering Applications of Artificial Intelligence, 2017.
- Pedregosa et al.  Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2011.
- Probst et al.  Philipp Probst, Bernd Bischl, and Anne-Laure Boulesteix. Tunability: Importance of Hyperparameters of Machine Learning Algorithms. arXiv:1802.09596 [stat], 2018.
- Ross  Sheldon M Ross. Introductory Statistics - 3rd Edition. Academic Press, 2010.
- Sansone and De Natale  Emanuele Sansone and Francesco G. B. De Natale. Training Feedforward Neural Networks with Standard Logistic Activations is Feasible. arXiv:1710.01013 [cs, stat], 2017.
- Scikit-Learn  Scikit-Learn. Model selection: choosing estimators and their parameters — scikit-learn 0.20.0 documentation, 2018.
- Seide and Agarwal  Frank Seide and Amit Agarwal. CNTK: Microsoft’s Open-Source Deep-Learning Toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2945397.
- Sonnenburg et al.  Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtěch Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, pages 1799–1802, 2010.
- Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 2014.
scikit-posthocs: Statistical post-hoc analysis and outlier detection algorithms, 2018.URL http://github.com/maximtrp/scikit-posthocs.
- Thomas et al.  Janek Thomas, Stefan Coors, and Bernd Bischl. Automatic gradient boosting. arXiv preprint arXiv:1807.03873, 2018.
- Wainer  Jacques Wainer. Comparison of 14 different families of classification algorithms on 115 binary datasets. arXiv:1606.00930 [cs], 2016.
- Wilcoxon  Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1945.
- Wing et al.  Max Kuhn Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R. Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan, and and Tyler Hunt. caret: Classification and Regression Training, 2018.
Appendix A Further benchmarking results
a.1 paired t-test, without multiple testing correction
max width= BaggingClassifier BaselineClassifier BernoulliNaiveBayes GaussianNaiveBayes t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 0.000 1.000 16.779 0.000 5.456 0.000 6.137 0.000 BaselineClassifier -16.779 0.000 0.000 1.000 -11.764 0.000 -9.373 0.000 BernoulliNaiveBayes -5.456 0.000 11.764 0.000 0.000 1.000 1.366 0.173 GaussianNaiveBayes -6.137 0.000 9.373 0.000 -1.366 0.173 0.000 1.000 GradientBoostingClassifier -1.407 0.161 14.703 0.000 3.719 0.000 4.610 0.000 K_Neighbours -0.734 0.463 16.373 0.000 4.837 0.000 5.600 0.000 NN-12-layer_wide_with_dropout -10.631 0.000 3.963 0.000 -6.273 0.000 -4.629 0.000 NN-12-layer_wide_with_dropout_lr01 -13.058 0.000 2.045 0.042 -8.562 0.000 -6.687 0.000 NN-12-layer_wide_with_dropout_lr1 -13.606 0.000 1.309 0.192 -9.183 0.000 -7.296 0.000 NN-2-layer-droput-input-layer_lr001 -6.453 0.000 8.206 0.000 -2.001 0.047 -0.660 0.510 NN-2-layer-droput-input-layer_lr01 -10.186 0.000 4.101 0.000 -5.927 0.000 -4.347 0.000 NN-2-layer-droput-input-layer_lr1 -11.574 0.000 3.020 0.003 -7.230 0.000 -5.521 0.000 NN-4-layer-droput-each-layer_lr0001 -5.912 0.000 8.451 0.000 -1.557 0.121 -0.278 0.781 NN-4-layer-droput-each-layer_lr01 -12.754 0.000 2.109 0.036 -8.336 0.000 -6.514 0.000 NN-4-layer-droput-each-layer_lr1 -12.640 0.000 2.350 0.020 -8.177 0.000 -6.346 0.000 NN-4-layer_thin_dropout -6.405 0.000 8.003 0.000 -2.042 0.042 -0.723 0.471 NN-4-layer_thin_dropout_lr01 -12.112 0.000 2.299 0.022 -7.839 0.000 -6.117 0.000 NN-4-layer_thin_dropout_lr1 -13.293 0.000 1.411 0.159 -8.937 0.000 -7.100 0.000 NN-4-layer_wide_no_dropout -4.958 0.000 9.704 0.000 -0.477 0.634 0.742 0.459 NN-4-layer_wide_no_dropout_lr01 -12.554 0.000 2.500 0.013 -8.067 0.000 -6.234 0.000 NN-4-layer_wide_no_dropout_lr1 -12.618 0.000 2.316 0.021 -8.174 0.000 -6.351 0.000 NN-4-layer_wide_with_dropout -5.043 0.000 9.661 0.000 -0.548 0.584 0.680 0.497 NN-4-layer_wide_with_dropout_lr01 -13.170 0.000 1.892 0.060 -8.690 0.000 -6.813 0.000 NN-4-layer_wide_with_dropout_lr1 -12.877 0.000 2.118 0.035 -8.416 0.000 -6.568 0.000 PassiveAggressiveClassifier -2.876 0.004 13.497 0.000 2.313 0.022 3.371 0.001 RandomForestClassifier 0.253 0.800 16.752 0.000 5.597 0.000 6.262 0.000 SVC -0.607 0.544 15.781 0.000 4.660 0.000 5.443 0.000 max width= GradientBoostingClassifier K_Neighbours NN-12-layer_wide_with_dropout NN-12-layer_wide_with_dropout_lr01 t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 1.407 0.161 0.734 0.463 10.631 0.000 13.058 0.000 BaselineClassifier -14.703 0.000 -16.373 0.000 -3.963 0.000 -2.045 0.042 BernoulliNaiveBayes -3.719 0.000 -4.837 0.000 6.273 0.000 8.562 0.000 GaussianNaiveBayes -4.610 0.000 -5.600 0.000 4.629 0.000 6.687 0.000 GradientBoostingClassifier 0.000 1.000 -0.749 0.454 9.091 0.000 11.374 0.000 K_Neighbours 0.749 0.454 0.000 1.000 10.190 0.000 12.634 0.000 NN-12-layer_wide_with_dropout -9.091 0.000 -10.190 0.000 0.000 1.000 1.837 0.068 NN-12-layer_wide_with_dropout_lr01 -11.374 0.000 -12.634 0.000 -1.837 0.068 0.000 1.000 NN-12-layer_wide_with_dropout_lr1 -11.936 0.000 -13.193 0.000 -2.470 0.014 -0.665 0.507 NN-2-layer-droput-input-layer_lr001 -5.027 0.000 -5.953 0.000 3.805 0.000 5.751 0.000 NN-2-layer-droput-input-layer_lr01 -8.702 0.000 -9.748 0.000 0.190 0.850 2.002 0.046 NN-2-layer-droput-input-layer_lr1 -10.008 0.000 -11.144 0.000 -0.855 0.394 0.961 0.338 NN-4-layer-droput-each-layer_lr0001 -4.541 0.000 -5.415 0.000 4.099 0.000 6.021 0.000 NN-4-layer-droput-each-layer_lr01 -11.115 0.000 -12.332 0.000 -1.734 0.084 0.084 0.933 NN-4-layer-droput-each-layer_lr1 -10.985 0.000 -12.214 0.000 -1.540 0.125 0.295 0.768 NN-4-layer_thin_dropout -5.013 0.000 -5.914 0.000 3.686 0.000 5.602 0.000 NN-4-layer_thin_dropout_lr01 -10.559 0.000 -11.693 0.000 -1.474 0.142 0.310 0.757 NN-4-layer_thin_dropout_lr1 -11.665 0.000 -12.881 0.000 -2.338 0.020 -0.549 0.584 NN-4-layer_wide_no_dropout -3.581 0.000 -4.436 0.000 5.141 0.000 7.126 0.000 NN-4-layer_wide_no_dropout_lr01 -10.891 0.000 -12.125 0.000 -1.416 0.158 0.427 0.670 NN-4-layer_wide_no_dropout_lr1 -10.972 0.000 -12.193 0.000 -1.560 0.120 0.269 0.788 NN-4-layer_wide_with_dropout -3.658 0.000 -4.520 0.000 5.091 0.000 7.079 0.000 NN-4-layer_wide_with_dropout_lr01 -11.489 0.000 -12.748 0.000 -1.968 0.050 -0.138 0.891 NN-4-layer_wide_with_dropout_lr1 -11.215 0.000 -12.454 0.000 -1.751 0.081 0.079 0.937 PassiveAggressiveClassifier -1.370 0.172 -2.238 0.026 7.982 0.000 10.251 0.000 RandomForestClassifier 1.618 0.107 0.976 0.330 10.700 0.000 13.096 0.000 SVC 0.791 0.430 0.086 0.931 9.919 0.000 12.269 0.000 max width= NN-12-layer_wide_with_dropout_lr1 NN-2-layer-droput-input-layer_lr001 NN-2-layer-droput-input-layer_lr01 NN-2-layer-droput-input-layer_lr1 t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 13.606 0.000 6.453 0.000 10.186 0.000 11.574 0.000 BaselineClassifier -1.309 0.192 -8.206 0.000 -4.101 0.000 -3.020 0.003 BernoulliNaiveBayes 9.183 0.000 2.001 0.047 5.927 0.000 7.230 0.000 GaussianNaiveBayes 7.296 0.000 0.660 0.510 4.347 0.000 5.521 0.000 GradientBoostingClassifier 11.936 0.000 5.027 0.000 8.702 0.000 10.008 0.000 K_Neighbours 13.193 0.000 5.953 0.000 9.748 0.000 11.144 0.000 NN-12-layer_wide_with_dropout 2.470 0.014 -3.805 0.000 -0.190 0.850 0.855 0.394 NN-12-layer_wide_with_dropout_lr01 0.665 0.507 -5.751 0.000 -2.002 0.046 -0.961 0.338 NN-12-layer_wide_with_dropout_lr1 0.000 1.000 -6.349 0.000 -2.624 0.009 -1.602 0.111 NN-2-layer-droput-input-layer_lr001 6.349 0.000 0.000 1.000 3.552 0.000 4.663 0.000 NN-2-layer-droput-input-layer_lr01 2.624 0.009 -3.552 0.000 0.000 1.000 1.031 0.304 NN-2-layer-droput-input-layer_lr1 1.602 0.111 -4.663 0.000 -1.031 0.304 0.000 1.000 NN-4-layer-droput-each-layer_lr0001 6.609 0.000 0.354 0.724 3.846 0.000 4.944 0.000 NN-4-layer-droput-each-layer_lr01 0.741 0.460 -5.600 0.000 -1.899 0.059 -0.868 0.386 NN-4-layer-droput-each-layer_lr1 0.954 0.341 -5.430 0.000 -1.708 0.089 -0.668 0.505 NN-4-layer_thin_dropout 6.194 0.000 -0.070 0.945 3.439 0.001 4.533 0.000 NN-4-layer_thin_dropout_lr01 0.950 0.343 -5.247 0.000 -1.640 0.102 -0.627 0.531 NN-4-layer_thin_dropout_lr1 0.109 0.914 -6.174 0.000 -2.493 0.013 -1.479 0.141 NN-4-layer_wide_no_dropout 7.710 0.000 1.339 0.182 4.864 0.000 5.999 0.000 NN-4-layer_wide_no_dropout_lr01 1.087 0.278 -5.319 0.000 -1.587 0.114 -0.542 0.588 NN-4-layer_wide_no_dropout_lr1 0.927 0.355 -5.439 0.000 -1.728 0.085 -0.691 0.491 NN-4-layer_wide_with_dropout 7.664 0.000 1.281 0.201 4.815 0.000 5.951 0.000 NN-4-layer_wide_with_dropout_lr01 0.528 0.598 -5.874 0.000 -2.130 0.034 -1.093 0.275 NN-4-layer_wide_with_dropout_lr1 0.740 0.460 -5.643 0.000 -1.916 0.057 -0.879 0.380 PassiveAggressiveClassifier 10.834 0.000 3.867 0.000 7.614 0.000 8.909 0.000 RandomForestClassifier 13.640 0.000 6.571 0.000 10.261 0.000 11.632 0.000 SVC 12.823 0.000 5.808 0.000 9.503 0.000 10.848 0.000 max width= NN-4-layer-droput-each-layer_lr0001 NN-4-layer-droput-each-layer_lr01 NN-4-layer-droput-each-layer_lr1 NN-4-layer_thin_dropout t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 5.912 0.000 12.754 0.000 12.640 0.000 6.405 0.000 BaselineClassifier -8.451 0.000 -2.109 0.036 -2.350 0.020 -8.003 0.000 BernoulliNaiveBayes 1.557 0.121 8.336 0.000 8.177 0.000 2.042 0.042 GaussianNaiveBayes 0.278 0.781 6.514 0.000 6.346 0.000 0.723 0.471 GradientBoostingClassifier 4.541 0.000 11.115 0.000 10.985 0.000 5.013 0.000 K_Neighbours 5.415 0.000 12.332 0.000 12.214 0.000 5.914 0.000 NN-12-layer_wide_with_dropout -4.099 0.000 1.734 0.084 1.540 0.125 -3.686 0.000 NN-12-layer_wide_with_dropout_lr01 -6.021 0.000 -0.084 0.933 -0.295 0.768 -5.602 0.000 NN-12-layer_wide_with_dropout_lr1 -6.609 0.000 -0.741 0.460 -0.954 0.341 -6.194 0.000 NN-2-layer-droput-input-layer_lr001 -0.354 0.724 5.600 0.000 5.430 0.000 0.070 0.945 NN-2-layer-droput-input-layer_lr01 -3.846 0.000