1 Introducing MLaut
MLaut [32] is a modelling and workflow toolbox in python, written with the aim of simplifying large scale benchmarking of machine learning strategies, e.g., validation, evaluation and comparison with respect to predictive/taskspecific performance or runtime. Key features are:

automation of the most common workflows for benchmarking modelling strategies on multiple datasets including statistical posthoc analyses, with userfriendly default settings

unified interface with support for scikitlearn strategies, keras deep neural network architectures, including easy user extensibility to (partially or completely) custom strategies

higherlevel metadata interface for strategies, allowing easy specification of scikitlearn pipelines and keras deep network architectures, with userfriendly (sensible) default configurations

easy setting up and loading of data set collections for local use (e.g., data frames from local memory, UCI repository, openML, Delgado study, PMLB)

backend agnostic, automated local file system management of datasets, fitted models, predictions, and results, with the ability to easily resume crashed benchmark experiments with long running times
MLaut may be obtained from pyPI via pip install mlaut, and is maintained on GitHub at github.com/alanturinginstitute/mlaut. A Docker implementation of the package is available on Docker Hub via docker pull kazakovv/mlaut.
Note of caution: time series and correlated/associated data samples
MLaut implements benchmarking functionality which provides statistical guarantees under assumption of either independent data samples, independent data sets, or both. This is mirrored in Section 2.3 by the crucial mathematical assumptions of statistical independence (i.i.d. samples), and is further expanded upon in Section 2.4.
In particular, it should be noted that naive application of the validation methodology implemented in MLaut to samples of time series, or other correlated/associated/nonindependent data samples (within or between datasets), will in general violate the validation methodologies’ assumptions, and may hence result in misleading or flawed conclusions about algorithmic performance.
The BSD license under which MLaut is distributed further explicitly excludes liability for any damages arising from use, nonuse, or misuse of MLaut (e.g., misapplication within, or in evaluation of, a time series based trading strategy).
1.1 Stateofart: modelling toolbox and workflow design
A hierarchy of modelling designs may tentatively be identified in contemporary machine learning and modelling ecosystems, such as the python data science environment and the R language:

provision of a unified interface for methodology solving the same “task”, e.g., supervised learning aka predictive modelling. This is one core feature of the Weka [29], scikitlearn [35] and Shogun [41] projects which both also implement level 1 functionality, and main feature of the caret [47] and mlr [6] packages in R which provides level 2 functionality by external interfacing of level 1 packages.

composition and metalearning interfaces such as tuning and pipeline building, more generally, firstorder operations on modelling strategies. Packages implementing level 2 functionality usually (but not always) also implement this, such as the general hyperparameter tuning and pipeline composition operations found in scikitlearn and mlr or its mlrCPO extension. Keras [12] has abstract level 3 functionality specific to deep learning, Shogun possesses such functionality specific to kernel methods.

workflow automation of higherorder tasks performed with level 3 interfaces, e.g., diagnostics, evaluation and comparison of pipeline strategies. Mlr is, to our knowledge, the only existing modelling toolbox with a modular, classbased level 4 design that supports and automates resampling based model evaluation workflows. The Weka GUI and module design also provides some level 4 functionality.
A different type of level 4 functionality is automated model building, closely linked to but not identical with benchmarking and automated evaluation  similarly to how, mathematically, model selection is not identical with model evaluation. Level 4 interfaces for automated model building also tie into level 3 interfaces, examples of automated model building are implemented in autoWeka [24], autosklearn [20], or extensions to mlrCPO [44].
In the Python data science environment, to our knowledge, there is currently no widely adopted solution with level 4 functionality for evaluation, comparison, and benchmarking workflows. The reasonably wellknown skll [1] package provides automation functionality in python for scikitlearn based experiments but follows an unencapsulated scripting design which limits extensibility and usability, especially since it is difficult to use with level 3 functionality from scikitlearn or stateofart deep learning packages.
Prior studies conducting experiments which are level 4 use cases, i.e., largescale benchmarking experiments of modelling strategies, exist for supervised classification, such as [19, 45]
. Smaller studies, focusing on a couple of estimators trained on a small number of datasets have also been published
[28]. However, to the best of our knowledge: none of the authors released a toolbox for carrying out the experiments; code used in these studies cannot be directly applied to conduct other machine learning experiments; and, deep neural networks were not included as part of the benchmark exercises.At the current stateofart, hence, there is a distinct need for level 4 functionality in the scikitlearn and keras ecosystems. Instead of recreating the mlr interface or following a GUIbased philosophy such as Weka, we have decided to create a modular workflow environment which builds on the particular strengths of python as an object oriented programming language, the notebookstyle user interaction philosophy of the python data science ecosystem, and the contemporary mathematicalstatistical stateofart with best practice recommendations for conducting formal benchmarking experiments  while attempting to learn from what we believe works well (or not so well) in mlr and Weka.
1.2 Scientific contributions
MLaut is more than a mere implementation of readily existing scientific ideas or methods. We argue that the following contributions, outlined in the manuscript, are scientific contributions closely linked to its creation:

design of a modular “level 4” software interface which supports the predictive model validation/comparison workflow, a data/model file input/output backend, and an abstraction of posthoc evaluation analyses, at the same time.

a comprehensive overview of the stateofart in statistical strategy evaluation, comparison and comparative hypothesis testing on a collection of data sets. We further close gaps in said literature by formalizing and explicitly stating the kinds of guarantees the different analyses provide, and detailing computations of related confidence intervals.

as a principal test case for MLaut, we conducted a largescale supervised classification study in order to benchmark the performance of a number of machine learning algorithms, with a key subquestion being whether more complex and/or costly algorithms tend to perform better on realworld datasets. On the representative collection of UCI benchmark datasets, kernel methods and random forests perform best.

as a specific but quite important subquestion we empirically investigated whether common offshelf deep learning strategies would be worth considering as a default choice on the “average” (nonimage, nontext) supervised learning dataset. The answer, somewhat surprising in its clarity, appears to be that they are not  in the sense that alternatives usually perform better. However, on the smaller tabular datasets, the computational cost of offshelf deep learning architectures is also not as high as one might naively assume. This finding is also subject to a major caveat and future confirmation, as discussed in Section 5.4.3 and Section 5.6.4.
Literature relevant to these contribution will be discussed in the respective sections.
1.3 Overview: usage and functionality
We present a short written demo of core MLaut functionality and user interaction, designed to be convenient in combination with jupyter notebook or scripting command line working style. Introductory jupyter notebooks similar to below may be found as part of MLaut’s documentation [32].
The first step is setting up a database for the dataset collection, which has to happen only once per computer and dataset collection, and which we assume has been already stored in a local MLaut HDF5 database. The first step in the core benchmarking workflow is to define hooks to the database input and output files:
After the hooks are created we can proceed to preparing fixed resampling splits (training/test) on which all strategies are evaluated. By default MLaut creates a single evaluation split with a uniformly sampled of the data for training and for testing.
For a simple setup, a standard set of estimators that come with sensible parameter defaults can be initialized. Advanced commands allow to specify hyperparameters, tuning strategies, keras deep learning architectures, scikitlearn pipelines, or even fully custom estimators.
The user can now proceed to running the experiments. Training, prediction and evaluation are separate; partial results, including fitted models and predictions, are stored and retrieved through database hooks. This allows intermediate analyses, and for the experiment to easily resume in case of a crash or interruption. If this happens, the user would simply need to rerun the code above and the experiment will continue from the last checkpoint, without reexecuting prior costly computation.
The last step in the pipeline is executing posthoc analyses for the benchmarking experiments. The AnalyseResults class allows to specify performance quantifiers to be computed and comparison tests to be carried out, based on the intermediate computation data, e.g., predictions from all the strategies.
The prediction_errors() method returns two sets of results: errors_per_estimator dictionary which is used subsequently in further statistical tests and errors_per_dataset _per_estimator_df which is a dataframe with the loss of each estimator on each dataset that can be examined directly by the user.
We can also use the produced errors in order to perform the statistical tests for method comparison. The code below shows an example of running a ttest.
Data frames or graphs resulting from the analyses can then be exported, e.g., for presentation in a scientific report.
Authors contributions
MLaut is part of VK’s PhD thesis project, the original idea being suggested by FK. MLaut and this manuscript were created by VK, under supervision by FK. The design of MLaut is by VK, with suggestions by FK. Sections 1, 2 and 3 were substantially edited by FK before publication, other sections received only minor edits (regarding content). The benchmark study of supervised machine learning strategies was conducted by VK.
Acknowledgments
We thank Bilal Mateen for critical reading of our manuscript, and especially for suggestions of how to improve readability of Section 2.4.
FK acknowledges support by The Alan Turing Institute under EPSRC grant EP/N510129/1.
2 Benchmarking supervised learning strategies on multiple datasets  generative setting
This section introduces the mathematicalstatistical setting for the mlaut toolbox  supervised learning on multiple datasets. Once the setting is introduced, we are able to describe the suite of statistical benchmark posthoc analyses that mlaut implements, in Section 3.
2.1 Informal workflow description
Informally, and nonquantitatively, the workflow implemented by mlaut is as follows: multiple prediction strategies are applied to multiple datasets, where each strategy is fitted to a training set and queried for predictions on a test set. From the test set predictions, performances are computed: performances by dataset, and also overall performances across all datasets, with suitable confidence intervals. For performance across all datasets, quantifiers of comparison (“is method A better than method B overall?”) are computed, in the form statistical (frequentist) hypothesis tests, where pvalues and effect sizes are reported.
The remainder of this Section 2 introduces the generative setting, i.e., statisticalmathematical formalism for the data sets and future situations for which performance guarantees are to be obtained. The reporting and quantification methodology implemented in the mlaut package is described in Section 3 in mathematical language, usage and implementation of these in the mlaut package is described in Section 4.
From a statistical perspective, it should be noted that only a single train/test split is performed for validation. This is partly due to simplicity of implementation, and partly due to the stateofart’s incomplete understanding of how to obtain confidence intervals or variances for resampled performance estimates. Crossvalidation strategies may be supported in future versions.
A reader may also wonder about whether, even if there is only a single set of folds, should there not be three folds per split (or two nested splits), into tuningtrain/tuningtest/test^{1}^{1}1What we call the “tuningtest fold” is often, somewhat misleadingly, called a “validation fold”. We believe the latter terminology is misleading, since it is actually the final test fold which validates the strategy, not second fold.. The answer is: yes, if tuning via resample split of the training set is performed. However, in line with current stateofart understanding and interface design, tuning is considered as part of the prediction strategy. That is, the tuningtrain/tuningtest split is strategyintrinsic. Only the train/test split is extrinsic, and part of the evaluation workflow which mlaut implements; a potential tuning split is encapsulated in the strategy. This corresponds with stateofart usage and understanding of the wrapper/composition formalism as implemented for example with GridSearchCV in sklearn.
2.2 Notational and mathematical conventions
To avoid confusion between quantities which are random and nonrandom, we always explicitly say if a quantity is a random variable. Furthermore, instead of declaring the type of a random variable, say
, by writing it out as a measurable function , we say “ is a random variable taking values in ”, or abbreviated “ t.v.in”, suppressing mention of the probability space
which we assume to be the same for all random variables appearing.This allows us easily to talk about random variables taking values in certain sets of functions, for example a prediction functional obtained from fitting to a training set. Formally, we will denote the set of functions from a set to a set by the type theoretic arrow symbol , where bracketing as in may be added for clarity and disambiguation. E.g., to clarify that we consider a function valued random variable , we will say for example “let be a random variable t.v.in ”.
An observant reader familiar with measure theory will notice a potential issue (others may want to skip to the next subsection): the set is, in general, not endowed with a canonical measure. This is remedied as follows: if we talk about a random variable taking values in , it is assumed that the image of the corresponding measurable function , which may not be all of , is a measurable space. This is, for example, the case we substitute training data random variables in a deterministic training functional , which canonically endows the image of with the substitution pushforward measure.
2.3 Setting: supervised learning on multiple datasets
We introduce mathematical notation to describe datasets, and prediction strategies. As running indices, we will consistently use for the dataset, for (training or test) data point in a given data set, and for the estimator.
The data in the th dataset are assumed to be sampled from mutually independent, generative/population random variables , taking values in featurelabelpairs , where either (regression) or is finite (classification). In particular we assume that the label type is the same in all datasets.
The actual data are i.i.d. samples from the population , which for notational convenience we assume to be split into a training set and a test set Note that the training and test set in the th dataset are, formally, not “sets” (as in common diction) but ordered tuples of length and . This is for notational convenience which allows easy reference to single data points. By further convention, we will write for the ordered tuple of test labels.
On each of the datasets, different prediction strategies are fitted to the training set: these are formalized as random prediction functionals t.v.in , where and . We interpret as the fitted prediction functional obtained from applying the th prediction strategy on the th dataset where it is fitted to the training set.
Statistically, we make mathematical assumptions to mirror the reasonable intuitive assumptions that there is no active information exchange between different strategies, a copies of a given strategy applied to different data sets: we assume that the random variable may depend on the training set , but is independent of all other data, i.e., the test set of the th dataset, and training and test sets of all the other datasets. It is further assumed that is independent of all other fitted functionals where and is entirely arbitrary. It is also assumed that is conditionally independent of all where , given .
We further introduce notation for predictions i.e., is the prediction made by the fitted prediction functional for the actually observed test label .
For convenience, the same notation is introduced for the generative random variables, i.e., Similarly, we denote by
the random vectors of length
whose entries are predictions for full test sample, made by method .2.4 Performance  which performance?
Benchmarking experiments produce performance and comparison quantifiers for the competitor methods. It is important to recognise that these quantifiers are computed to create guarantees for the methods’ use on putative future data
. These guarantees are obtained based on mathematical theorems such as the central limit theorem, applicable under empirically justified assumptions. It is crucial to note that mathematical theorems allow establishing performance guarantees on future data, despite the future data not being available to the experimenter at all. It is also important to note that the future data for which the guarantees are created are different from, and in general not identical to, the test data.
Contrary to occasional belief, performance on the test data in isolation is empirically not useful: without a guarantee it is unrelated to the argument of algorithmic effectivity the experimenter wishes to make.
While a full argument usually does involve computing performance on a statistically independent test set, the argumentative reason for this best practice is more subtle than being of interest by itself. It is a consequence of “prediction” performance on the training data not being be a fair proxy for performance on future data. Instead, “prediction” on an unseen (statistically independent) test set is a fair(er) proxy, as it allows for formation of performance guarantees on future data: the test set being unseen allows to leverage the central limit theorems for this purpose.
In benchmark evaluation, it is hence crucial to make precise the relation between the testing setting and the application case on future data  there are two key types of distinctions on the future data application case:

whether in the scenario, a fitted prediction function is to be reused, or whether it is refitted on new data (potentially from a new data source).

whether in the scenario, the data source is identical with the source of one of the observed datasets, or whether the source is merely a source from the same population as the data sources observed.
Being precise about these distinctions is, in fact, practically crucial: similar to the best practice of not testing on the training set, one needs to be careful about whether a data source, or a fitted strategy that will occur in the future test case has already been observed in the benchmarking experiment, or not.
We make the above mathematically precise (a reader interested only in an informal explanation may first like skip forward to the subsequent paragraph).
To formalize “reuse”, distinction (i) translates to conditioning on the fitted prediction functionals , or not. Conditioning corresponds to prior observation, hence having observed the outcome of the fitting process, therefore “reusing” . Not doing so corresponds to sampling again from the random variable, hence “refitting”.
To formalize the “data source” distinction, we will assume an i.i.d. process
(taking values in joint distributions over
also selected at random), generating distributions according to which population laws are distributed, i.e., is an i.i.d. sample. The th element of this sample, is the (generating) data source for the th data set i.e., . We stress that takes values in distributions, i.e., is a distribution which is itself random^{2}^{2}2Thus, the symbol is used here in its common “distribution” and not “distribution of random variable” meaning which are usually confounded by abuse of notation and from which data are generated. In this mathematical setting, the distinction (ii) then states whether the guarantee applies for data sampled from with a specific , or instead data sampled from The former is “data from the already observed th source, the latter is “data from a source similar to, but not identical to, the observed source”. If the latter is the case, the same generative principle is applied to yield a prediction functional , drawn i.i.d. from a hypothetical generating process which yielded the on the th dataset. We remain notationally consistent by definingFor intuitive clarity, let us consider an example
: three supervised classification methods, a random forest, logistic regression, and the baseline “predicting the majority class” are benchmarked on 50 datasets, from 50 hospitals, one dataset corresponding to observations in exactly one hospital. Every dataset is a sample of patients (data frame rows) for which as variables (data frame rows) the outcome (= prediction target and data frame column) therapy success yes/no for a certain disease is recorded, plus a variety of demographic and clinical variables (data frame columns)  where what is recorded differs by hospital.
A benchmarking experiment may be asked to produce a performance quantifier for one of the following three distinct key future data scenarios:

reusing the trained classifiers (e.g., random forest), trained on the training data of hospital 42, to make predictions on future data observed in hospital 42.

(re)fitting a given classifier (e.g., random forest) to new data from hospital 42, to make predictions on further future data observed in hospital 42.

obtaining future data from a new hospital 51, fitting the classifiers to that data, and using the so fitted classifiers to make predictions on further future data observed from hospital 51.
It is crucial to note that both performances and guarantees may (and in general will) differ between these three scenarios. In hospital 42, a random forests may outperform logistic regression and the baseline, while in hospital 43 nothing outperforms the baseline. The behaviour and ranking of strategies may also be different, depending on whether classifiers are reused, or refitted. This may happen in the same hospital, or when done in an average unseen hospital. Furthermore, the same qualitative differences as for observed performances may hold for the precision of the statistical guarantees obtained from performances in a benchmarking experiment: the sample size of patients in a given hospital may be large enough or too small to observe a significant difference of performances in a given hospital, while the sample size of hospitals is the key determinant of how reliable statistical guarantees about performances and performance differences for unseen hospitals are.
In the subsequent, we introduce abbreviating terminology for denoting the distinctions above: for (i), we will talk about reused (after training once) and retrained (on new data) prediction algorithm. For (ii), we will talk about seen and unseen data sources. Further, we will refer to the three future data scenarios abbreviatingly by the letters (a), (b), and (c). By terminology, in these scenarios the algorithm is: (a) reused on seen sources, (b) retrained on seen sources, and (c) retrained on an unseen source (similar to but not identical to seen sources).
It should be noted that it is impossible to reuse an algorithm on an unseen source, by definition of the word “unseen”, hence the hypothetical fourth combination of the two dichotomies reused/retrained and unseen/seen is logically impossible.
2.5 Performance quantification
Performance of the prediction strategy is measured by a variety of quantifiers which compare predictions for the test set with actual observations from the test set, the “ground truth”. Three types of quantifiers are common:

Average loss based performance quantifiers, obtained from a comparison of one method’s predictions and ground truth observations onebyone. An example is the mean squared error on the test set, which is the average squared loss.

Aggregate performance quantifiers, obtained from a comparison of all of a given method’s predictions with all of the ground truth observations. Examples are sensitivity or specifity.

Ranking based performance quantifiers, obtained from relative performance ranks of multiple methods, from a ranked comparison against each other. These are usually leveraged for comparative hypothesis tests, and may or may not involve computation of ranks based on average or aggregate performances as in (i) and (ii). Examples are the Friedman rank test to compare multiple strategies.
The three kinds of performance quantifiers are discussed in more detail below.
2.5.1 Average based performance quantification
For this, the most widely used method is a loss (or score) function , which compares a single prediction (by convention the first argument) with a single observation (by convention the second argument).
Common examples for such loss/quantifier functions are listed below in Table 1.


task  name  loss/quantifier function 


classification (det.)  MMCE  


regression  squared loss  
absolute loss  
Qloss  
where  

List of some popular loss functions to measure prediction goodness (2nd column) used in the most frequent supervised prediction scenarios (1st column). Above,
and are elements of . For classification, is discrete; for regression, . The symbol evaluates to if the boolean expression is true, otherwise to .In direct alignment with the different future data scenarios discussed in Section 2.4, the distributions of three generative random variables are of interest:

The conditional random variable , the loss when predicting on future data from the th data source, when reusing the already trained prediction functional . Note that formally, through conditioning is implicitly considered constant (not random), therefore reflects reuse of an already trained functional.

The random variable , the loss when retraining method on training data from the th data source, and predicting labels on future data from the th data source. Without conditioning, no reuse occurs, and this random variable reflects repeating the whole random experiment including retraining of .

The random variable , the loss when training method on a completely new data source, and predicting labels on future data from the same source as that dataset.
The distributions of the above random variables are generative, hence unknown. In practice, the validation workflow estimates summary statistics of these. Of particular interest in the mlaut workflow are related expectations, i.e., (arithmetic) population average errors. We list them below, suppressing notational dependency on for ease of notation:

, the (training set) conditional expected generalization error of (a reused) , on data source .

, the conditional expected generalization error of the (reused) th strategy, averaged over all seen data sources.

, the unconditional expected generalization error of (a retrained) , on data source .

, the expected generalization error on a typical (unseen) data source.
It should be noted that and are random quantities, but conditionally constant once the respective are known (e.g., once has been trained). It further holds that
The mlaut toolbox currently implements estimators for only two of the above three future data situations  namely, only for situations (a: reused, seen) and (c: retrained, unseen), i.e., estimators for all quantities with the exception of . The reason for this is that for situation (b: retrained, seen), at the current state of literature it appears unclear how to obtain good estimates, that is, with provably favourable statistical properties independent of the data distribution or the algorithmic strategy. For situations (a) and (c), classical statistical theory may be leveraged, e.g., mean estimation and frequentist hypothesis testing.
It should also be noted that is a single dataset performance quantifier rather than a benchmark performance quantifier, and therefore outside the scope of mlaut’s core use case. While is also a single dataset quantifier, it is easy to estimate en passant while estimating the benchmark quantifier , hence included in discussion as well as in mlaut’s functionality.
2.5.2 Aggregate based performance quantification
A somewhat less frequently used alternative are aggregate loss/score functions , which compare a tuple of predictions with a tuple of observations in a way that is not expressible as a mean loss such as in Section 2.5.1. Here, by slight abuse of notation, denotes tuples of pairs, of fixed length. The use of the symbol is discordant with the previous section and assumes a case distinction on whether an average or an aggregate is used.
The most common uses of aggregate performance quantifiers are found in deterministic binary classification, as entries of the classification contingency table. These, and further common examples are listed below in Table
2.


task  name  loss/quantifier function 


classification (det., binary)  sensitivity, recall  
specificity  
precision, PPV  
F1 score  


regression  root mean squared error  

As before, for the different future data scenarios in Section 2.4, the distributions of three types of generative random variables are of interest. The main complication is that aggregate performance metrics take multiple test points and predictions as input, hence to specify a population performance one must specify a test set size. In what follows, we will fix a specific test set size, , for the th dataset. Recall the notation for the full vector of test labels on data set . In analogy, we abbreviatingly denote by random vectors of length whose entries are predictions for full test sample, made by method , i.e., having as the th entry to predictions , as introduced in Section 2.3. Similarly, we denote by and vectors whose entries, are i.i.d. from the data generating distribution of the new data source, and both of length , which is by assumption the sampling distribution of the .
The population performance quantities of interest can be formulated in terms of the above:

, the (training set) conditional expected generalization error of (a reused) , on data source .

, the conditional expected generalization error of the (reused) th strategy, averaged over all seen data sources.

, the unconditional expected generalization error of (a retrained) , on data source .

, the expected generalization error on a typical (unseen) data source.
As before, the future data situations are (a: reused algorithm, seen sources), (b: retrained, seen), and (c: retrained, unseen). In the general setting, the expectations in (a) and (b) may or may not converge to sensible values as approaches infinity, depending on properties of . General methods of estimating these depend on availability of test data, which due to the complexities arising and the currently limited stateofart are outside the scope of mlaut. This unfortunately leaves benchmarking quantity outside the scope for aggregate performance quantifiers. For (c), classical estimation theory of the mean applies.
2.5.3 Ranking based performance quantification
Ranking based approaches consider, on each dataset, a performance ranking of the competitor strategies with respect to a chosen raw performance statistic, e.g., an average or an aggregate performance such as RMSE or F1score. Performance assessment is then based on the rankings  in the case of ranking, this is most often a comparison, usually in the form of a frequentist hypothesis test. Due to the dependence of the ranking on a raw performance statistic, it should always be understood that ranking based comparisons are with respect to the chosen raw performance statistic, and may yield different results for different raw performance statistics.
Mathematically, we introduce the population performances in question. Denote in the case the raw statistic being an average, and denote in case it is an aggregate (on the RHS using notation of the respective previous Sections 2.5.1 and 2.5.2). The distribution of models generalization performance of the th strategy on the th dataset.
We further define rankings as the order rank of within the tuple , i.e., the ranking of the performance within all strategies’ performances on the th dataset.
Of common interest in performance quantification and benchmark comparison are the average ranks, i.e., ranks of a strategy averaged over datasets. The population quantity of interest is the expected average rank on a typical dataset, i.e., where is the population variable corresponding to sample variables . It should be noted that the average rank depends not only on what the th strategy is or does, but also on the presence of the other strategies in the benchmarking study  hence it is not an absolute performance quantifier for a single method, but a relative quantifier, to be seen in the context of the competitor field.
Common benchmarking methodology of the ranking kind quantifies relative performance on the data sets observed in the sense of future data scenario (b) or (c), where the performance is considered including (re)fitting of the strategies.
3 Benchmarking supervised learning strategies on multiple datasets  methods
We now describe the suite of performance and comparison quantification methods implemented in the mlaut package. It consists largely of stateofart of model comparison strategies for the multiple datasets situation, supplemented by our own constructions based on standard statistical estimation theory where appropriate. References and prior work will be discussed in the respective subsections. mlaut supports the following types of benchmark quantification methodology and posthoc analyses:

lossbased performance quantifiers, such as mean squared error and mean absolute error, including confidence intervals.

aggregate performance quantifiers, such as contingency table quantities (sensitivity, specifity) in classification, including confidence intervals.

rank based performance quantifiers, such as average performance rank.

comparative hypothesis tests, for relative performance of methods against each other.
The exposition uses notation and terminology previously introduced in Section 2. Different kinds of quantifiers (loss and/or rank based), and different kinds of future performance guarantees (trained vs refitted prediction functional; seen vs unseen sources), as discussed in Section 2.4, may apply across all types of benchmarking analyses.
Which of these is the case, especially under which future data scenario the guarantee given is supposed to hold, will be said explicitly for each, and should be taken into account by any use of the respective quantities in scientific argumentation.
Practically, our recommendation is to consider which of the future data scenarios (a), (b), (c) a guarantee is sought for, and whether evidencing differences in rank, or differences in absolute performances, are of interest.
3.1 Average based performance quantifiers and confidence intervals
For average based performance quantifiers, performances and their confidence intervals are estimated from the sample of loss/score evaluates. We will denote the elements in this sample by (for notation on RHS see Section 2.5.1 ). Note that, differently from the population quantities, there are three (not two) indices: for the strategy, for the dataset, and for which test set point we are considering.


estimate  estimates  f.d.s.  standard error estimate  CLT in 


(a)  
(a)  
(c)  

Table 3 presents a number of expected loss estimates with proposed standard error estimates. As all estimates are mean estimates of independent (or conditionally independent) quantities, normal approximated, twosided confidence intervals may be obtained for any of the quantities in the standard way, i.e., at confidence as the interval
where is the respective (mean) estimate and is the corresponding standard error estimate.
Note that different estimates and confidence intervals arise through the different future data scenarios that the guarantee is meant to cover  see Sections 2.5.1 and 2.4 for a detailed explanation how precisely the future data scenarios differ in terms of refitting/reusing the prediction functional, and obtaining performance guarantees for predictive use on an unseen/seen data source. In particular, choosing a different future data scenario may affect the confidence intervals even though the midpoint estimate is the same: the midpoint estimates and coincide, but the confidence intervals for future data scenario (c), i.e., new data source and the strategy is refitted, are usually wider than the confidence intervals for the future data scenario (a), i.e., already seen data source and no refitting of the strategy.
Technically, all expected loss estimates proposed in Table 3 are (conditional) mean estimates. The confidence intervals for and are obtained as standard confidence intervals for a (conditionally) independent sample mean: is considered to be the mean of the independent samples (varying over ). is considered to be the mean of the conditionally independent samples (varying over , and conditioned on ). Confidence intervals for are obtained averaging the estimated variances of independent summands , which corresponds to the plugin estimate obtained from the equality (all variances conditional on the ).
3.2 Aggregate based performance quantifiers and confidence intervals
For aggregate based performance quantifiers, performances and their confidence intervals are estimated from the sample of loss/score evaluates. We will denote the elements in this sample by (for notation on RHS see Section 2.5.2). We note that unlike in the case of average based evaluation, there is no running index for the test set data point, only indices for the data set and for the prediction strategy.


estimate  estimates  f.d.s.  standard error estimate  CLT in 


(c)  

Table 4 presents one estimate of expected loss estimates with proposed standard error estimate, for future data situation (c), i.e., generalization of performance to a new dataset. Even though there is only a single estimate, we present it in a table for concordance with Table 3. An confidence interval at confidence is obtained as
The mean and variance estimates are obtained from standard theory of mean estimation, by the same principle as for average based estimates. Estimates for situations (a) may be naively constructed from multiple test sets of the same size, or obtained from further assumptions on via resampling, though we abstain from developing such an estimate as it does not seem to be common  or available  at the stateofart.
3.3 Rank based performance quantifiers
mlaut has functionality to compute rankings based on any average or aggregate performance statistic, denoted below. I.e., for any choice of , the following may be computed.
As in Section 2.5.3, define in the case the raw statistic being an average, and in case it is an aggregate. Denote by the order rank of within the tuple .


estimate  estimates  f.d.s.  standard error estimate  CLT in 


(c)  
(c)  

Table 5 presents an average rank estimates and an average rank difference estimate, for future data situation (c), i.e., generalization of performance to a new dataset.
The average rank estimate and its standard error is based on the central limit theorem in the number of data sets. The average rank difference estimate is Neményi’s critical difference as referred to in [16] which is used in visualizations.
3.4 Statistical tests for method comparison
While the methods in previous sections compute performances with confidence bands, they do not by themselves allow to compare methods in the sense of ruling out that differences are due to randomness (with the usual statistical caveat that this can never be ruled out entirely, but the plausibility can be quantified).
mlaut implements significance tests for two classes of comparisons: absolute performance differences, and average rank differences, in future data scenario (c), i.e., with a guarantee for the case where the strategy is refitted to a new data source.
mlaut’s selection follows closely, and our exposition below follows loosely, the work of [16]. While the latter is mainly concerned with classifier comparison, there is no restrictioninprinciple to leverage the same testing procedures for quantitative comparison with respect to arbitrary (average or aggregate) raw performance quantifiers.
3.4.1 Performance difference quantification
The first class of tests we consider quantifies, for a choice of aggregate or average loss , the significance of average differences of expected generalization performances, between two strategies and . The meanings of “average” and “significant” may differ, and so does the corresponding effect size  these are made precise below.
All the tests we describe are based on the paired differences of performances, where the pairing considered is the pairing through datasets. That is, on dataset , there are performances of strategy and which are considered as a pair of performances.
For the paired differences, we introduce abbreviating notation if the performance is an average loss/score, and if the loss is an aggregate loss/score. Nonparametric tests below will also consider the ranks of the paired differences, we will write for the rank of within the sample , i.e., taking values between and .
We denote by and the respective population versions, i.e., the performance difference on a random future dataset, as in scenario (c).


name  tests null  e.s.(raw)  e.s.(norm)  stat. 


paired ttest  where  


Wilcoxon  
signedrank t.  

Table of pairwise comparison tests for benchmark comparison. name = name of the testing procedure. tests null = the null hypothesis that is tested by the testing procedure. e.s.(raw) = the corresponding effect size, in raw units. e.s.(norm) = the corresponding effect size, normalized. stat. = the test statistic which is used in computation of significance. Symbols are defined as in the previous sections.
Table 6 lists a number of common testing procedures. The significances may be seen as guarantees for future data situation (c). The normalized effect size for the paired ttest comparing the performance of strategies and , the quantity in Table 6, is called Cohen’s d(statistic) for paired samples (to avoid confusion in comparison with literature, it should be noted that Cohen’s dstatistic also exists for unpaired versions of the ttest which we do not consider here in the context of performance comparison). The normalized effect size for the Wilcoxon signedrank test, the quantity , is called biserial rank correlation, or rankbiserial correlation.
It should also be noted that the Wilcoxon signedrank test, while making use of rank differences, is not a pairwise comparison of strategies’ performance ranks  this is a common misunderstanding. While “ranks” appear in both concepts, the ranks in the Wilcoxon signedrank tests are the ranks of the performance differences, pooled across data sets, while in a rank based performance quantifier, the ranking of different methods’ performances (not differences) within a data sets (not across data sets) is considered.
The above tests are implemented for onesided and twosided alternatives. See [37], [16], or [46] for details.
Portmanteau tests for the above may be based on parametric ANOVA, though [16] recommends avoiding these due to the empirical asymmetry and nonnormality of loss distributions. Hence for multiple comparisons, mlaut implements Bonferroni and BonferroniHolm significance correction based posthoc testing.
In order to compare the performance of the prediction functions one needs to perform statistical tests on the output produced by . Below we enumerate the statistical tests that can be employed to assess the results produced by the loss functions as described in 2.5.1.
3.4.2 Performance rank difference quantification
Performance rank based testing uses the observed performance ranks of the th strategy, on the th data set. These are defined as above in Section 3.3, of which we keep notation, including notation for the average rank estimate . We further introduce abbreviating notation for rank differences, .


name  tests null  e.s.(raw)  e.s.(norm)  stat. 


sign test  


Friedman  
test  (for some )  

Table 7 describes common testing procedures which may both be seen as tests for a guarantee of expected rank difference in future data scenario (c). The sign test is a binomial test regarding the proportion being significantly different from . In case of ties, a trinomial test is used. The implemented version of the Friedman test uses the Fstatistic (and not the Qstatistic aka chisquaredstatistic) as described in [16].
For posthoc comparison and visualization of average rank differences, mlaut implements the combination of Bonferroni and studentized rannge multiple testing correction with Neményi’s confidence intervals, as described in 3.3.
4 MLaut, API Design and Main Features
MLaut [32] is a modelling and workflow toolbox that was written with the aim of simplifying the task of running machine learning benchmarking experiments. MLaut was created with the specific usecase of largescale performance evaluation on a large number of real life datasets, such as the study of [19]. Another key goal was to provide a scalable and unified highlevel interface to the most important machine learning toolboxes, in particular to include deep learning models in such a largescale comparison..
Below, we describe package design and functionality. A short usage handbook is included in Section 4.5
MLaut may be obtained from pyPI via pip install mlaut, and is maintained on GitHub at github.com/alanturinginstitute/mlaut. A Docker container can also be obtained from Docker Hub via docker pull kazakovv/mlaut.
4.1 Applications and Use
MLaut main use case is the setup and execution of supervised (classification and regression) benchmarking experiments. The package currently provides an highlevel workflow interface to scikitlearn and keras models, but can easily be extended by the user to incorporate model interfaces from additional toolboxes into the benchmarking workflow.
MLaut automatically creates begintoend pipeline for processing data, training machine learning experiments, making predictions and applying statistical quantification methodology to benchmark the performance of the different models.
More precisely, MLaut provides functionality to:

Automate the entire workflow for largescale machine learning experiments studies. This includes structuring and transforming the data, selecting the appropriate estimators for the task and data data at hand, tuning the estimators and finally comparing the results.

Fit data and make predictions by using the prediction strategies as described in 5.4 or by implementing new prediction strategies.

Evaluate the results of the prediction strategies in a uniform and statistically sound manner.
4.2 Highlevel Design Principles
We adhered to the highlevel API design principles adopted for the scikitlearn project [10]. These are:

Consistency.

Inspection.

Nonproliferation of classes.

Composition.

Sensible defaults.
We were also inspired by the Weka project [29], a platform widely used for its data mining functionalities. In particular, we wanted to replicate the ease of use of Weka in a pythonic setting.
4.3 Design Requirements
Specific requirements arise from the main use case of scalable benchmarking and the main design principles:

Extensibility. MLaut needs to provide a uniform and consistent interface to level 3 toolbox interfaces (as in Section 1.1). It needs to be easily extensible, e.g., by a user wanting to add a new custom strategy to benchmark.

Data collection management. Collections of data sets to benchmark on may be found on the internet or exist on a local computer. MLaut needs to provide abstract functionality for managing such data set collections.

Algorithm/model management. In order to match algorithms with data sets, MLaut needs to have abstract functionality to do so. This needs to include sensible default settings and easy metadata inspection of standard methodology.

Orchestration management. MLaut needs to conduct the benchmarking experiment in a standardized way with minimal user input beyond its specification, with sensible defaults for the experimental setup. The orchestration module needs to interact with, but be separate from the data and algorithm interface.

User Friendliness. The package needs to be written in a pythonic way and should not have a steep learning curve. Experiments need to be easy to setup, conduct, and summarize, from a python console or a jupyter notebook.
In our implementation of MLaut, we attempt to address the above requirements by creating a package which:

Has a nice and intuitive scripting interface. One of our main requirements was to have a native Python scripting interface that integrates well with the rest of our code. Our design attempts to reduce user interaction to the minimally necessary interface points of experiment specification, running of experiments, and querying of results.

Provides a high level of abstraction form underlying toolboxes. Our second criteria was that MLaut provided high level of abstraction from underlying toolboxes. One of our main requirements was for MLaut to be completely model and toolbox agnostic. The scikitlearn interface was too lightweight for our purposes as its parameter and metadata management is not interface explicit (or inspectable).

Provides Scalable workflow automation. This needed to be one of MLaut’s cornerstone contributions. Its main logic is implemented in the orchestrator class that orchestrates the evaluation of all estimators on all datasets. The class manages resources for building the estimator models, saving/loading the data and the estimator models. It is also aware of the experiment’s partial run state and can be used for easy resuming of an interrupted experiment.

Allows for easy estimator construction and retrieval. The end user of the package should be able to easily add new machine learning models to the suite of build in ones in order to expand its functionality. Besides a small number of required methods to implement, we have provided interfaces to two of the most used level 3 toolbox packages, sklearn and keras.

Has a dedicated metadata interface for sensible defaults of estimators. We wanted to ensure that the estimators that are packaged in MLaut come with sensible defaults, i.e. predefined hyperparameters and tuning strategies that should be applicable in most use cases. The robustness of these defaults has been tested and proven as part of the original largescale classification study. As such, the user is not required to have a detailed understanding of the algorithms and how they need to be set up, in order to make full use them.

Provides a framework for quantitative benchmark reporting. Easily accessible evaluation methodology for the benchmarking experiments is one of the key features of the package. We also considered reproducibility of results as vital, reflected in a standardized setup and interface for the experiments, as well as control throughout of pseudorandom seeds..

Orchestrates the experiments and parallelizes the load over all available CPU cores. A large benchmarking study can be quite computationally expensive. Therefore, we needed to make sure that all available machine resources are fully utilized in the process of training the estimators. In order to achieve this we used the parallelization methods that are available as part of the GridSearch method and natively with some of the estimators. Furthermore, we also provide a Docker container for running MLaut which we recommend using as a default as it allows the package to run in the background at full load.

Provides a uniform way of storing a retrieving data. Results of benchmarking experiments needed to be saved in a uniform way and made available to users and reviewers of the code. At the current stage, we implemented backend functionality for management via local HDF5 database files. In the future, we hope to support further data storage backends with the same orchestratorsided facade interface.
4.3.1 Estimator encapsulation
MLaut implements a logic of encapsulating the metadata with the estimators that it pertains to. This is achieved by using a decorator class that is attached to each estimator class. By doing this, our extended interface is are able to bundle wideranging metadata information with each estimator class. This includes:

Basic estimator properties such as name, estimator family;

Types of tasks that a particular estimator can be applied to;

The type of data which the estimator expects or can handle;

The model architecture (on level 3, as in Section 1.1). This is particularly useful for more complex estimators such as deep neural networks. By applying the decorator structure the model architecture can be easily altered without changing the underlying estimator class.
This extended design choice has significant benefits for a benchmarking workflow package. First of all, it allows fsearching for estimators based on some basic criteria such as task or estimator family. Second of all, it allows to inspect, query, and change default hyperparameter settings used by the estimators. Thirdly, strategies with different internal model architectures can be deployed with relative ease.
4.3.2 Workflow design
The workflow supported by MLaut consists of the following main steps:

Data collection. As a starting point the user needs to gather and organize the datasets of interest on which the experiments will be run. The raw datasets need to be saved in a HDF5 database. Metadata needs to be attached to each dataset which is later used in the training phase for example for distinguishing the target variables. MLaut provides an interface for manipulating the databases through its Data and Files_IO classes. The logic of the toolbox is to provision two HDF5 databases one for storing the input data such as the datasets and a second one to store the output of the machine learning experiments and processed data such as train/test index splits. This separation of input and output is not required but is recommended. The datasets also need to be split in a train and test set in advance of proceeding with the next phase in the pipeline. The indices of the train and test splits are stored separately from the actual datasets in the HDF5 database to ensure data integrity and reproducibility of the experiments. All estimators are trained and tuned on the training set only. At the end of this process the estimators are used on the test sets which guarantees that all predictions are made on unseen data.

Training phase. After the datasets are stored in the HDF5 database by following the convention adopted by MLaut the user can proceed to training the estimators. The user needs to provide an array of machine learning estimators that will be used in the training process. MLaut provides a number of default estimators that can be instantiated. This can be done by the use of the estimators module. The package also provides the flexibility for the user to write its own estimator by inheriting from the mlaut_estimator class. Furthermore, there is a generic_estimator module which provides flexibility for the user to create new estimators with only a couple of lines of code.
The task of training the experiments is performed by the experiments. Orchestrator class. This class manages the sequence of the training the the parallelization of the load. Before training each dataset is preprocessed according to metadata provided on the estimator level. This includes normalizing the features and target variables, conversion from categorical to numerical values.
We recommend running the experiments inside a Docker container if they are very computationally intensive. This allows MLaut to run in the background on a server without shutting down unexpectedly due to loss of connection. We have provided a Docker image that makes this process easy.

Making predictions. During training the fitted models are stored on the hard drive. At the end of the training phase the user can again use experiments.Orchestrator class to retrieve the trained models and make predictions on the test sets.

Analyse results. The last stage is analysing the output of the results of the machine learning experiments. In order to initiate the process the user needs to call the analyze_results.prediction_errors method which returns two dictionaries with the average errors per estimator on all datasets as well as the errors per estimator achieved on each dataset. These results can be used as inputs to the statistical tests that are also provided as part of the analyze_rezults module which mostly follow the methodology proposed by [16].
4.4 Software Interface and Main Toolbox Modules
MLaut is built around the logic of the pipeline workflow described earlier. Our aim was to implement the programming logic for each step of the pipeline in a different module. The code that is logically used in more than one of the stages is implemented in a Shared module that is accessible by all other classes. The current design pattern is most closely represented by the façade and adaptor patterns under which the user interacts with one common interface to access the underlying adaptors which represent the underlying machine learning and statistical toolboxes.
4.4.1 Data Module
The Data module contains the high level methods for manipulating the raw datasets. It provides a second layer of interface to the lower level classes for accessing, storing and extracting data from HDF5 databases. This module uses heavily the functionality developed in the Shared module but provides a higher level of abstraction for the user.
4.4.2 Estimators Module
This module encompasses all machine learning models that come with MLaut as well as methods for instantiating them based on criteria provided by the user. We created MLaut for the purpose of running supervised classification experiments but the toolbox also comes with estimators that can be used for supervised regression tasks.
From a software design perspective the most notable method in this class is the build method which returns an instantiated estimator with the the appropriate hyper parameter search space and model architecture. In software design terms this approach resembles more closely the builder design pattern which aims at separating the construction of and object from its representation. This design choice allows the base mlaut_estimator class to create different representations of machine learning models.
The mlaut_estimator object includes methods that complete its set of functionalities. Some of the main ones are a save method that takes into account the most appropriate format to persist a trained estimator object. This could include the pickle format used by most scikitlearn estimators or the HDF5 format used by keras. A load function is also available for restoring the saved estimators.
The design of the package also relies on the estimators having a uniform fit and predict methods that takes the same input date and generate predictions in the same format. These methods are not implemented at the mlaut_estimator level but instead we relied on the fact that these fundamental methods will be uniform across the underlying packages. However, there is a discrepancy in the behaviour of the scikitlearn and keras estimators. For classification tasks keras
requires the labels of the training data to be one hot encoded. Furthermore, the default behaviour of the keras
predict method is equivalent to the predict_proba in scikitlearn. We solved these discrepancies by overriding the fit and predict methods of the implemented keras estimators.Through the use of decorators and by implementing the build method we are able to fully customize the estimator object with minimal required programming. The decorator class allows to set the metadata associated with the estimator. This includes setting the name, estimator family, types of tasks and hyper parameters. This together with an implemented build method will give the user a fully specified machine learning model. This approach also facilitates the application of the algorithms and the use of the software as we can ensure that each algorithm is matched to the correct datasets. Furthermore, this allows to easily retrieve the required algorithms by executing a simple command.
Closely following terminology and taxonomy of [30], mlaut estimators are currently assigned to one of the following methodological families:

[label=)]

Baseline Estimators. This family of models is also referred to as a dummy estimator and serves as a benchmark to compare other models to. It does not aim to learn any representation of the data but simply adopts a strategy of guessing.

Generalized Linear Model Estimators. A family of models that assumes that a (generalized) linear relationship exists between the dependent and target values.

Naive Bayes Estimators
. This class of models applies the Bayes theorem my making the naive assumption that all features are independent.

Prototype Method Estimators
. Family of models that apply prototype matching techniques for fitting the data. The most prominent member of this family is the Kmeans algorithm.

Kernel Method Estimators
. Family of models using kernelization techniques, including support vector machine based estimazors.

Deep Learning and Neural Network Estimators. This family of models provides implementation of neural network models, including deep neural networks.

EnsemblesofTrees Estimators. Family of methods that combines the predictions of several treebased estimators in order to produce a more robust overall estimator. This family is further divided in:

averaging methods. The models in this group average the predictions of several independent models in order to arrive at a combined estimator. An example is Breiman’s random forest.

boosting methods
. An ensembling approach of building models sequentially based on iterative weighted residual fitting. An example are stochastic gradient boosted tree models.

In addition ot this the user also has the option to write their own estimator objects. In order to achieve this the new class needs to inherit from the mlaut_estimator class. and implementing the abstract methods in each child class. The main abstract method that needs to be implemented is the build method which returns an wrapped instance of the estimator with a set of hyperparameters that will be used in the tuning process. For further details about the implemented estimators refer to 5.4.
4.4.3 Experiments Module
This module contains the logic for orchestration of the machine learning experiments. The main parameters in this module are the datasets and the estimator models that will be trained on the data. The main run method of the module then proceeds to training all estimators on all datasets, sequentially. The core of the method represent two embedded for
loops the first of which iterates over the datasets and the second one over the estimators. Inside the inner loop the orchestrator class builds an estimator instance for each dataset. This allows to tailor the machine learning model for each dataset. For example, the architecture of a deep neural network can be altered to include the appropriate number of neurons based on the input dimensions of the data. This module is also responsible for saving the trained estimators and making predictions. It should be noted that the orchestrator module is not responsible for the parallelization of the experiments which is handled on an individual estimator level.
4.4.4 Result Analysis Module
This module includes the logic for performing the quantitative evaluation and comparison of the machine learning strategies’ performance. The predictions of the trained estimators on the test sets for each dataset serve as input. First, performances and, if applicable, standard errors on the individual data sets are computed, for a given average or aggregate loss/performance quantifier. The samples of performances are then used as inputs for comparative quantification.
APIwise, the framework for assessing the performance of the machine learning estimators hinges on three main classes. The anlyze_results class implements the calculation of the quantifiers. Through composition this class relies on the losses class that performs the actual calculation of the prediction performances over the individual test sets. The third main class that completes the framework design is the scores class. It defines the loss/quantifier function that is used for assessing the predictive power of the estimators. An instance of the scores class is passed as an argument to the losses class.
We believe that this design choice of using three classes is required to provide the necessary flexibility for the composite performance quantifiers as described in Section 3  i.e., to allow to compute ranks for an arbitrarily chosen loss (e.g., mean rank with resect to mean absolute error), or to perform comparison testing using an arbitrarily chosen performance quantifier (e.g., Wilcoxon signed rank test comparing F1scores).
Our API also facilitates user custom extension, e.g., for users who wish to add a new score function, an efficient way to compute aggregate scores or standard errors, or a new comparison testing methodology. For example, adding new score functions can be easily achieved by inheriting from the MLautScore abstract base class. On the other hand, the losses class completely encapsulates the logic for the calculation of the predictive performance of the estimators. This is particularly useful as the class internally implements a mini orchestrator procedure for calculating and presenting the loss achieved by all estimators supplied as inputs. Lastly, the suite of statistical tests available in MLaut can be easily expanded by adding the appropriate method to the analyze_results class or a descendant.
Mathematical details of the implemented quantification procedures implemented in MLaut were presented in Section 3. Usage details
In this implementation of MLaut we use thirdparty packages for performing the statistical tests. We rely mostly on the scikitlearn package. However, for post hoc tests we use the scikitposthocs package [43] and the Orange package [17] which we also used for creating critical distance graphs for comparing multiple classifiers.
4.4.5 Shared Module
This module includes classes and methods that are shared by the other modules in the package. The Files_IO class comprises of all methods for manipulating files and datasets. This includes saving/loading of trained estimators from the HDD and manipulating the HDF5 databases. The Shared module also keeps all static variables that are used throughout the package.
4.5 Workflow Example
We give a stepbystep overview over the most basic variant of the user workflow. Advanced examples with custom estimators and setups may be found in the MLaut tutorial [32].
Step 0: setting up the data set collection
The user should begin by setting up the data set collection via the Files_IO class. Metadata for each dataset needs to be provided that includes as a minimum the class column name/target attribute and name of dataset. This needs to be done once for every dataset collection, and may not need to be done for a preexisting or predeployed collection. Currently, only local HDF5 data bases are supported.
We have implemented backend setup routines which download specific data set collections and generate the metadata automatically. Current support includes the UCI library data sets and OpenML. Alternatively, the backend may be populated directly by storing an inmemory pandas DataFrame via the save_pandas_dataset method, e.g., as part of custom loading scripts.
In this case, metadata for the individual datasets needs to be provided in the following dictionary format:
Step 1: initializing data and output locations
As the next step, the user should specify the backend links to the data set collections (“input”) and to intermediate or analysis results (“output”).This is done via the data class. It is helpful for code readability to store these in codeinput_io and out_io variables.
These may then be supplied as parameters to preparation and orchestration routines. We then proceed to getting the paths to the raw datasets as well as the respective train/test splits which is performed respectively though the use of list_datasets and split_datasets methods.
Step 2: initializing estimators
The next step is to instantiate the learning strategies, estimators in sklearn terminology, which we want to use in the benchmarking exercise. The most basic and fully automated variant is use of the instantiate_default_estimators method which loads a predefined set of defaults given specified criteria. Currently, only a simple string lookup via the estimators parameter is implemented, but we plan to extend the search/matching functionality. The string criterion may be used to fetch specific estimators by a list of names, entire families of models, estimators by task (e.g., classification), or simply all available estimators.
Step 3: orchestrating the experiment
The final step is to run the experiment by passing references to data and estimators to the orchestrator class, then initating the training process by invoking its run method.
Step 4: computing benchmark quantifiers
After the estimators are trained and the predictions of the estimators are recorded we can proceed to obtaining quantitative benchmark results for the experiments.
For this, we need to instantiate the AnalyseResults class by supplying the folders where the raw datasets and predictions are stored. Its prediction_errors method may be invoked to returns both the calculated prediction performance quantifiers, per estimator as well as the prediction performances per estimator and per dataset.
The prediction errors per dataset per estimator can be directly examined by the user. On the other hand, the estimator performances may be used as further inputs for comparative quantification via hypothesis tests. For example, we can perform a paired ttest for pairwise comparison of methods by invoking the code below:
5 Using MLaut to Compare the Performance of Classification Algorithms
As an major test use case for MLaut, we conducted a largescale benchmark experiment comparing a selection of offshelf classifiers on datasets from the UCI Machine Learning Repository. Our study had four main aims:

stress testing the MLaut framework on scale, and observing the user interaction workflow in a major test case.

replicating the key points of the experimental setup by [19], while avoiding their severe mistake of tuning on the test set.

including deep learning methodology to the experiment.
Given the above, the below benchmarking study is, to the best of our knowledge, the first largescale supervised classification study which^{3}^{3}3in the disjunctive sense: i.e., to the best of our knowledge, the first largescale benchmarking study which does any of the above rather than being only the first study to do all of the above.:

is correctly conducted via outofsample evaluation and comparison. This is since [19] commit the mistake of tuning on the test set, as it is even acknowledged in their own Section 3 Results and Discussion.

includes contemporary deep neural network classification approaches, and is conducted on a broad selection of classification data sets which is not specific to a special domain such as image classification (the UCI dataset collection).
We intend to extend the experiment in the future by including further dataset collections and learning strategies.
Full code for our experiments, including random seeds, can be found as a jupyter notebook in MLaut’s documentation [32].
5.1 Hardware and software setup
The benchmark experiment was conducted on a Microsoft Azure VM with 16 CPU cores and 32 GB of RAM, by our Docker virtualized implementation of MLaut. The experiments ran for about 8 days. MLaut requires Python 3.6 and should be installed in a dedicated virtual environment in order to avoid conflicts or the Docker implementation should be used. The full code for running the experiments and the code for generating the results in results Appendix A can be found in the examples directory in the GitHub repository of the project.
5.2 Experimental setup
5.2.1 Data set collection
The benchmarking study uses the same dataset collection as employed by [19]
. This collection consists of 121 tabular datasets for supervised classification, taken directly from the UCI machine learning repository. Prior to the experiment, each dataset was standardized, such that each individual feature variable has a mean of 0 and a standard deviation of 1.
The dataset collection of [19] intends to be representative of a wide scope of basic realworld classification problems. It should be noted that this representative crosssection of simple classification tasks excludes more specialized tasks such as image, audio, or text/document classification which are usually regarded to by typical applications of deep learning, and for which deep learning is also the contemporary stateofart. For a detail description of the FernándezDelgado et al. [19] data collection, see section 2.1 there.
5.2.2 Resampling for evaluation
Each dataset is in split into exactly one pair of training and test set. The training sets are selected, for each data set, uniformly at random^{4}^{4}4independently for each dataset in the collection as (a rounded) of the available data sample; the remaining in the dataset form the test set on which the strategies are asked to make predictions. Random seeds and the indices of the exact splits were saved to ensure reproducibility and posthoc scrutiny of the experiments.
The training set may (or may not) be further split by the contender methods for tuning  as stated previously in Section 2, this is not enforced as part of the experimental setup^{5}^{5}5Unlike in the setup of [19] which, on top of doing so, is also faulty., but is left to each learning strategy to deal with internally, and will be discussed in the next section. In particular, none of the strategies have access to the test set for tuning or training.
5.3 Evaluation and comparison
We largely followed the procedure suggested by [16] for the analysis of the performance of the trained estimators. For all classification strategies, the following performance quantifiers are computed per dataset:

misclassification loss

rank of misclassification loss

runtime
Averages of these are computed, with standard errors for future data situation (c: retrained, on unseen dataset). In addition, for the misclassification loss on each data set, standard errors for future data situation (a: reused, same dataset) are computed.
The following pairwise comparisons between samples of performances by dataset are computed:

paired ttest on misclassification losses, with Bonferroni correction

(paired) Wilcoxon signed rank on misclassification losses, with Bonferroni correction

Friedman test on ranks, with Neményi’s significant rank differences and posthoc significances
Detail descriptions of these may be found in Section 3.
5.4 Benchmarked machine learning strategies
Our choice of classification strategies is not exhaustive, but is meant to be representative of offshelf choices in the scikitlearn and keras packages. We intend to extend the selection in future iterations of this study.
From scikitlearn, the suite of standard offshelf approaches includes linear models, Naive Bayes, SVM, ensemble methods, and prototype methods.
We used keras to construct a number of neural network architectures representative of the stateofart. This proved a challenging task due to the lack of explicitly recommended architectures for simple supervised classification to be found in literature.
5.4.1 Tuning of estimators
It is important to note that the offshelf choices and their default parameter settings are often not considered good or stateofart: hyperparameters in scikitlearn are by default not tuned, and there are no default keras that come with the package.
For scikitlearn classifiers, we tune parameters using scikitlearn’s GridSearchCV wrappercompositor (which never looks at the test set by construction).
In all cases of tuned methods, parameter selection in the inner tuning loop is done via grid tuning by 5fold crossvalidation, with respect to the default score function implemented at the estimator level. For classifiers as in our study, the default tuning score is mean accuracy (averaged over all 5 tuning test folds in the inner crossvalidation tuning loop), which is equivalent to tuning by mean misclassification loss.
The tuning grids will be specified in Section 5.4.2 below.
For keras
classifiers, we built architectures by interpolating general best practice recommendations in scientific literature
[34], as well as based on concrete designs found in software documentation or unpublished case studies circulating on the web. We further followed the sensible default choices of keras whenever possible.The specific choices for neural network architecture and hyperparameters are specified in Section 5.4.3 below.
5.4.2 Offshelf scikitlearn supervised strategies

[label=)]

Algorithms that do not have any tunable hyperparameters
Estimator name sklearn.dummy.DummyClassifier Description This classifier is a naive/uninformed baseline and always predicts the most frequent class in the training set (“majority class”). This corresponds to the choice of the most_frequent parameter. Hyperparameters None Estimator name sklearn.naive_bayes.BernoulliNB Description Naive Bayes classifier for multivariate Bernoulli models. This classifier assumes that all features are binary, if not they are converted to binary. For reference please see [7] Chapter 2. Hyperparameters None Estimator name sklearn.naive_bayes.GaussianNB Description Standard implementation of the Naive Bayes algorithm with the assumption that the features are Gaussian. For reference please see [7] Chapter 2. Hyperparameters None 
Linear models
Estimator name sklearn.linear_model.PassiveAggressiveClassifier Description Part of the online learning family of models based on the hinge loss function. This algorithm observes featurevalue pairs in sequential manner. After each observation the algorithm makes a prediction, checks the correct value and calibrates the weights. For further reference see [15]. Hyperparameters C: array of 13 equally spaced numbers on a log scale in the range scikitlearn default: 1 
Clustering Algorithms
Estimator name sklearn.neighbors.KNeighborsClassifier Description The algorithm uses a majority vote of the nearest neighbours of each data point to make a classification decision. For reference see [14] and [7], Chapter 2. Hyperparameters n_neighbors=[1;30], scikitlearn default: 5 p=[1,2], scikitlearn default:2 
Kernel Methods
Estimator name sklearn.svm.SVC Description This estimator is part of the Support Vector family of algorithms. In this study, we use the Gaussian kernel only. For reference see [13] and [7], Chapter 7. The performance of support vector machine is very sensitive with respect to tuning parameters: 
C, the regularization parameter. There does not seem to be a consensus in the community regarding the space for the C hyperparameter search. In an example^{6}^{6}6At the time of writing this paper the example was available on this link: http://scikitlearn.org/stable/auto_examples/svm/plot_rbf_parameters.html the scikitlearn documentation refers to an initial hyperparameter search space for C in the range [39]. However, a different example^{7}^{7}7At the time of writing this paper the example was available on this link: http://scikitlearn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html suggests . A third scikitlearn example^{8}^{8}8At the time of writing this paper the example was available on this link: https://scikitlearn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphxglrautoexamplesmodelselectionplotgridsearchdigitspy suggests testing for both the linear and rbf kernels and broad values for the C and parameters. Other researches [27] suggest to use apply a search for C in the range which we used in our study as it provides a good compromise between reasonable running time and comprehensiveness of the search space.

, the inverse kernel bandwith. The scikitlearn example^{9}^{9}9At the time of writing this paper the example was available on this link: http://scikitlearn.org/stable/auto_examples/svm/plot_rbf_parameters.html [39] suggests hyperparameter search space for in the range . However, a second scikitlearn example^{10}^{10}10At the time of writing this paper the example was available on this link: http://scikitlearn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html suggest to search only in . On the other hand, [27] suggest searching for in the range which again we found to be the middle ground and applied in our study.
Hyperparameters C: array of 13 equally spaced numbers on a log scale in the range , scikitlearn default: 1. gamma: array of 13 equally spaced numbers on a log scale in the range , scikitlearn default: auto 

Ensemble Methods
The three main models that we used in this study that are part of this family are the RandomForest, Bagging and Boosting. The three models are built around the logic of using the predictions of a large number of weak estimators, such as decision trees. As such they share a lot of the same hyperparameters. Namely, some of the main parameters for this family of models are the number of estimators, max number of features and the maximum tree depth, default values for each estimator which are suggested in the
scikitlearn package. Recent research [36] and informal consensus in the community suggest that the performance gains from deviating from the default parameters are rewarded for the Boosting algorithm but tend to have limited improvements for the RandomForest algorithm. As such, for the purposes of this study we will focus our efforts to tune the Boosting and Bagging algorithms but will use a relatively small parameter search space for tuning RandomForest.Estimator name sklearn.ensemble.GradientBoostingClassifier Description Part of the ensemble metaestimators family of models. We used the default sklearn deviance loss. The algorithm fits a series of decision trees on the data and predictions are made based on a majority vote. At each iteration the data is modified by applying weights to it and predictions are made again. At each iteration the weights of the incorrectly pairs are increased (boosted) and decreased for the correctly predicted pairs. As per the scikitlearn documentation [39] This estimator is not recommended for datasets with more than two classes as it requires the introduction of regression tress at each iteration. The suggested approach is to use the RandomForest algorithm instead. A lot of the datasets used in this study are multiclass supervised learning problems. However, for the purposes of this study we will use the Gradient Boosting algorithm in order to see how it performs when benchmarked to the suggested approach. For reference see [21] and [18], Chapter 17. Hyperparameters number of estimators: , scikitlearn default: 100. max depth: integers in the range , scikitlearn default: 3. Estimator name sklearn.ensemble.RandomForestClassifier Description Part of the ensemble metaestimators family of models. The algorithm fits decision trees on subsamples of the dataset. The average voting rule is used for making predictions. For reference see [9] and [18], Chapter 17. Hyperparameters For this study we used the following hyperparameter grid: number of estimators: , scikitlearn default: 10 max features: [auto, sqrt, log2, None], scikitlearn default: auto max depth: , scikitlearn default: None Estimator name sklearn.ensemble.BaggingClassifier Description Part of the ensemble metaestimators family of models. The algorithm draws with replacement featurelabel pairs , trains decision base tree estimators and makes predictions based on voting or averaging rules. For reference see [8] and [18], Chapter 17. Hyperparameters number of estimators: , scikitlearn default: 10
5.4.3 Keras neural network architectures including deep neural networks
We briefly summarize our choices for hyperparameters and architecture.

Architecture. Efforts have been made to make the choice of architectures less arbitrary by suggesting algorithms for finding the optimal neural network architecture [23]. Other researches have suggested good starting points and best practices that one should adhere to when devising a network architecture [25, 38]. We also followed the guidelines of [22], in particular Chapter 6.4. The authors conducted an empirical study and showed that the accuracy of networks increases as the number of layers grow but the gains are diminishing rapidly beyond 5 layers. These findings are also confirmed by other studies [3] that question to need to use very deep feedforward networks. In general, the consensus in the community seems to be that 24 hidden layers are sufficient for most feedforward network architectures. One notable exception to this rule seem to be convolutional network architecture which have been showed to perform best when several sequential layers are stacked one after the other. However, this study does not
make use of convolutional neural networks, as our data is not suitable for these models, in particular because there is no wellspecified way to transform samples into a multidimensional array form. The architectures are given below as their keras code specification.

Activation function
. We used the rectified linear unit (ReLu) as our default choice of activation function as has been found to accelerate convergence and is relatively inexpensive to perform
[33]. 
Regularization. We employ the current stateofart in neural network regularization: dropout. In the absence of clear rules when and where dropout should be applied, we include two versions of each neural network in the study: one version not using dropout, and one using dropout. Dropout regularization is as described by Hinton et al. [26], Srivastava et al. [42] where its potential for improving the generalization accuracy of neural networks is shown. We used a dropout rate of 0.5 as suggested by the authors.

Hyperparameter tuning. We did not perform grid search to find the optimal hyper parameters for the network. The reason for this is twofold. We interfaced the neural network models from keras. The keras interface is not fully compatible with scikit learn’s GridSearch, nor does it provide easy offshelf tuning facilities (see subsection 4.4.2 for details). Furthermore, using grid search tuning does not seem to be considered common practice by the community, and it is even actively recommended to avoid by some researchers [5]
, hence might not be considered a fair representation of the stateofart. Instead, the prevalent practice seems to be manual tuning of hyperparameters based on learning curves. Following the latter in the absence of offshelf automation, we manually tuned learning rate, batch size, and number of epochs by manual inspection of learning curves and performances on the full
training sets (see below). 
Learning Rate. The learning rate is one of the crucial hyperparameter choices when training neural networks. The generally accepted rule to find the optimal rate is to start with a large rate and if the training process does not diverge decrease the learning rate by a factor of 3 [4]. This approach is confirmed by [42] who also affirm that a larger learning rate can be used in conjunction with dropout without risking that the weights of the model blow out.

Batch Size. The datasets used in the study were relatively small and could fit in the memory of the machine that we used for training the algorithms. As a result we set the batch size to equal the entire dataset which is equivalent to full gradient descent.

Number of epochs. We performed manual hyperparameter selection by inspection of individual learning curves for all combinations of learning rate and architecture. For this, learning curves on individual data sets’ training samples were inspected visually for the “plateau range” (range of minimal training error). For all architectures, and most data sets, the plateau was already reached for one single epoch, and training error usually tended to increase in the range of 50500 epochs. The remaining, small number of datasets (most of which were of 4orabovedigit sample size) plateaued in the 1digit range.
While this is a very surprising finding as it corresponds to a single gradient descent step, it is what we found, while following what we consider the standard manual tuning steps for neural networks. We further discuss this in Section 5.6 and acknowledge that this surprising finding warrants further investigation, e.g., through checking for mistakes, or including neural networks tuned by automated schemes.
Thus, all neural networks architectures were trained for one single epoch  since choosing a larger (and more intuitive number of epochs) would have been somewhat arbitrary, and not in concordance with the common manual tuning protocol.
For the keras models, we adopted six neural network architectures with varying depths and widths. Our literature review revealed that there is no consistent body of knowledge or concrete rules pertaining to constructing neural network models for simple supervised classification (as opposed to image recognition etc). Therefore, we extrapolated from general best practice guidelines as applicable to our study, and also included (shallow) network architectures that were previously used in benchmark studies. The full keras architecture of the neural networks used are listed below.
Estimator name  keras.models.Sequential 
Description  Own architecture of Deep Neural Network model applying the principles highlighted above. For this experiment we made used of the empirical evidence that networks of 34 layers were sufficient to learn any function discussed in [3]. However, we opted for a slightly narrower network in order to investigate whether wider nets tend to perform better than narrow ones. 
Hyperparameters  batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy. 
Estimator name  keras.models.Sequential 
Description  In this architecture we experimented with the idea that wider networks perform better than narrower ones. No dropout was performed in order to test the idea that regularization is necessary for all deep neural network models. 
Hyperparameters  batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy. 
Estimator name  keras.models.Sequential 
Description  We tested the same architecture as above but applying dropout after the first two layers. 
Hyperparameters  batch size: None, learning , loss: mean squared error, optimizer: Adam, metrics: accuracy. 
Estimator name  keras.models.Sequential 
Description  Deep Neural Network model inspired from architecture suggested by [38]: 
Hyperparameters  batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy. 
Estimator name  keras.models.Sequential 
Description  Deep Neural Network model suggested in [26] with the following architecture: 
Hyperparameters  batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy. 
Estimator name  keras.models.Sequential 
Description  Deep Neural Network model suggested in [42] with the following architecture: 
Hyperparameters  batch size: None, learning rate: , loss: mean squared error, optimizer: Adam, metrics: accuracy. 
5.5 Results
Table 8 shows an summary overview of results.
avg_rank  avg_score  std_error  avg training time (in sec)  
RandomForestClassifier  4.3  0.831  0.013  14.277 
SVC  5.0  0.818  0.014  1742.466 
K_Neighbours  5.6  0.805  0.014  107.796 
BaggingClassifier  5.8  0.820  0.014  5.231 
GradientBoostingClassifier  7.6  0.790  0.016  49.509 
PassiveAggressiveClassifier  8.5  0.758  0.016  19.352 
NN4layer_wide_with_dropout_lr001  10.0  0.692  0.021  14.617 
NN4layer_wide_no_dropout_lr001  10.5  0.694  0.021  14.609 
BernoulliNaiveBayes  10.9  0.707  0.015  0.005 
NN4layerdroputeachlayer_lr0001  11.2  0.662  0.022  6.786 
NN4layer_thin_dropout_lr001  11.6  0.652  0.022  2.869 
NN2layerdroputinputlayer_lr001  11.7  0.655  0.021  5.420 
GaussianNaiveBayes  13.4  0.674  0.019  0.004 
NN12layer_wide_with_dropout_lr001  16.3  0.535  0.023  40.003 
NN2layerdroputinputlayer_lr01  17.3  0.543  0.023  5.413 
NN2layerdroputinputlayer_lr1  17.9  0.509  0.023  5.437 
NN4layer_thin_dropout_lr01  18.0  0.494  0.024  5.559 
NN4layer_wide_no_dropout_lr01  18.4  0.494  0.022  10.530 
NN4layerdroputeachlayer_lr1  18.5  0.488  0.022  6.901 
NN4layer_wide_with_dropout_lr1  18.5  0.483  0.022  10.738 
NN4layer_wide_no_dropout_lr1  18.6  0.490  0.022  10.561 
NN4layer_wide_with_dropout_lr01  18.7  0.478  0.022  10.696 
NN4layerdroputeachlayer_lr01  18.8  0.482  0.022  6.818 
NN12layer_wide_with_dropout_lr01  18.8  0.479  0.022  70.574 
NN12layer_wide_with_dropout_lr1  19.0  0.458  0.023  68.505 
NN4layer_thin_dropout_lr1  19.4  0.462  0.023  4.299 
BaselineClassifier  23.7  0.419  0.019  0.001 
Figure 5 summarizes the samples of performances in terms of classification accuracy. The sample is performance by method, ranging over data sets, averaged over the test sample within each dataset  i.e., the size of the sample of performance equals the number of data sets in the collection.
The Friedman test was significant at level p=2e16. Figure 6 displays effect sizes, i.e., average ranks with Neményi’s posthoc critical differences.
From all the above, the top five algorithms among the contenders were the Random Forest, SVC, Bagging, K Neighbours and Gradient Boosting classifiers.
Further benchmarking results may be found in the automatically generated Appendix A. These include results of paired ttests and Wilcoxon signed rank tests. Briefly summarizing these: Neither ttest (Appendix A.1), nor the Wilcoxon signed rank test (Appendix A.2), with Bonferroni correction (adjacent strategies and all vs baseline), in isolation, are able to reject the null hypothesis of a performance difference between any two of the top five performers.
5.6 Discussion
We discuss our findings below, including a comparison with the benchmarking study by FernándezDelgado et al. [19].
5.6.1 Key findings
In summary, the key findings of the benchmarking study are:

MLaut is capable of carrying out largescale benchmarking experiments across a representative selection of offshelf supervised learning strategies, including stateofart deep learning models, and a selection of smalltomoderatesized basic supervised learning benchmark data sets.

On the selection of benchmark data sets representative for basic (nonspecialized) supervised learning, the best performing algorithms are ensembles of trees and kernelbased algorithms. Neural networks (deep or not) perform poorly in comparison.

Of the algorithms benchmarked, gridtuned support vector classifiers are the most demanding of computation time. Neural networks (deep or not) and the other algorithms benchmarked require computation time in a comparable orders of magnitude.
5.6.2 Limitations
The main limitations of our study are:

restriction to the Delgado data set collection. Our study is at most as representative for the methods’ performance as the Delgado data set collection is for basic supervised learning.

training the neural networks for one epoch only. As described in 5.4.3 we believe we arrived at this choice following standard tuning protocol, but it requires further investigation, especially to rule out a mistake  or to corroborate evidence of a potential general issue of neural networks with basic supervised learning (i.e., not on image, audio, text data etc).

A relative small set of prediction strategies. While our study is an initial proofofconcept for MLaut on commonly used algorithms, it did not include composite strategies (e.g., full pipelines), or the full selection available in stateofart packages.
5.6.3 Comparison to the study of Delgado et al
In comparison to the benchmarking study of FernándezDelgado et al. [19], for most algorithms we find comparable performances which are within 95% confidence bands (of ours). A notable major departure is performance of the neural networks, which we find to be substantially worse. The latter finding may be plausibly explained by at least one of the following:

an overly optimistic bias of FernándezDelgado et al. [19]
, through their mistake of tuning on the test set. This bias would be expected to be most severe for the models with most degrees of freedom to tune  i.e., the neural networks.
In additional comparison, the general rankings (when disregarding the neural networks) are similar. Though, since a replication of rankings is dependent on conducting the study on exactly the same set of strategies, we are only able to state this qualitatively. Conversely, our confidence intervals indicate that rankings in general are very unstable on the data set collection, as roughly a half of the 179 classifiers which FernándezDelgado et al. [19] benchmarked seem to be within 95% confidence ranges of each other.
This seems to highlight the crucial necessity of reporting not only performances but also confidence bands, if reasoning is to be conducted about which algorithmic strategies are the “best” performing ones.
5.6.4 Conclusions
Our findings corroborate most of the findings of the major existing benchmarking study of FernándezDelgado et al. [19]. In addition, we validate the usefulness of MLaut to easily conduct such a study.
As a notable exception to this confirmation of results, we find that neural networks do not perform well on “basic” supervised classification data sets. While it may be explained by a bias that FernándezDelgado et al. [19] introduced into their study by the mistake of tuning on the test set, it is still under the strong caveat that further investigation needs to be carried out, in particular with respect to the tuning behaviour of said networks, and our experiment not containing other mistakes.
However, if further investigation confirms our findings, it would be consistent with the findings of one of the original dropout papers [42], in which the authors also conclude that the improvements are more noticeable on image datasets and less so on other types of data such as text. For example, the authors found that the performance improvements achieved on the Reuters RCV1 corpus were not significant in comparison with architectures that did not use dropout. Furthermore, at least in our study we found no evidence to suggest that deep architectures performed better than shallow ones. In fact the 12 layer deep neural network architecture ranked just slightly better than our baseline classifier. Our findings also may suggest that wide architectures tend to perform better than thin ones on our training data. It should also be pointed out that the datasets we used in this experiment were relatively small in size. Therefore, it could be argued that deep neural networks can easily overfit such data, the default parameter choices and standard procedures are not appropriate  especially since such common practice may arguably be strongly adapted to image/audio/text data.
In terms of training time, the SVC algorithm proved to be the most expensive, taking on average almost 30 min to train in our setup. However, it should be noted that this is due to the relatively large hyperparameter search space that we used. On the other hand, among the top five algorithms the Bagging Classifier was one of the least expensive ones to train taking an average of only 5 seconds. Our top performer, the Random Forest Classifier, was also relatively inexpensive to train taking an average of only 14 seconds.
As our main finding, however, we consider the ease with which a user may generate the above results, using MLaut. The reader may (hopefully) convince themselves of this by inspecting the code and jupyter notebooks in the repository . We are also very appreciative of any criticism, or suggestions for improvement, made (say, by an unconvinced reader) through the project’s issue tracker.
References
 [1] scikitlearn laboratory. https://skll.readthedocs.io. URL https://skll.readthedocs.io.
 Abadi et al. [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
 Ba and Caruana [2013] Lei Jimmy Ba and Rich Caruana. Do Deep Nets Really Need to be Deep? arXiv:1312.6184 [cs], 2013.
 Bengio [2012] Yoshua Bengio. Practical recommendations for gradientbased training of deep architectures. arXiv:1206.5533 [cs], 2012.
 Bergstra and Bengio [2012] James Bergstra and Yoshua Bengio. Random Search for HyperParameter Optimization. Journal of Machine Learning Research, 2012.
 Bischl et al. [2016] Bernd Bischl, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016. URL http://www.jmlr.org/papers/v17/15066.html.
 Bishop [2006] Christopher Bishop. Pattern Recognition and Machine Learning. SpringerVerlag New York, 2006. ISBN 781493938438.
 Breiman [1996] Leo Breiman. Bagging predictors. Machine Learning, 1996.
 Breiman [2001] Leo Breiman. Random Forests. Machine Learning, 2001.
 Buitinck et al. [2013] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake Vanderplas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. Api design for machine learning software: experiences from the scikitlearn project. CoRR, 2013.
 Chen et al. [2015] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 [cs], 2015.
 Chollet [2015] François Chollet. Keras, 2015. URL https://keras.io.
 Cortes and Vapnik [1995] Corinna Cortes and Vladimir Vapnik. SupportVector Networks. Machine Learning, 1995.
 Cover and Hart [1967] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1967.
 Crammer et al. [2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. Online passiveaggressive algorithms. The Journal of Machine Learning Research, 2006.
 Demšar [2006] Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006.
 Demšar et al. [2013] Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research, 2013.
 Efron and Hastie [2016] Bradley Efron and Trevor Hastie. Computer Age Statistical Inference: Algorithms, Evidence and Data Science. Institute of Mathematical Statistics Monographs. Cambridge University Press, Cambridge, 2016.
 FernándezDelgado et al. [2014] Manuel FernándezDelgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 2014.
 Feurer et al. [2015] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.
 Friedman [2001] Jerome Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 2001.
 Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 Gupta and Raza [2018] Tarun Kumar Gupta and Khalid Raza. Optimizing Deep Neural Network Architecture: A Tabu Search Based Approach. arXiv:1808.05979 [cs, stat], 2018.
 Hall et al. [2009] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
 Hasanpour et al. [2016] Seyyed Hossein Hasanpour, Mohammad Rouhani, Mohsen Fayyaz, and Mohammad Sabokrou. Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures. arXiv:1608.06037 [cs], 2016.
 Hinton et al. [2012] Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv:1207.0580 [cs], 2012.
 Hsu et al. [2003] ChihWei Hsu, ChihChung Chang, and ChihJen Lin. A Practical Guide to Support Vector Classification. page 16, 2003.
 Huang et al. [2003] Jin Huang, Jingjing Lu, and Charles Ling. Comparing naive Bayes, decision trees, and SVM with AUC and accuracy. IEEE Comput. Soc, 2003.
 Jagtap and Kodge [2013] Sudhir Jagtap and Bheemashankar Kodge. Census Data Mining and Data Analysis using WEKA. arXiv:1310.4647 [cs], 2013.
 James et al. [2013] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Introduction to Statistical Learning. Springer Publishing Company, Incorporated, 2013. ISBN 9781461471370.
 Jia et al. [2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding, 2014.
 Kazakov and Király [2018] Viktor Kazakov and Franz Király. mlaut: Machine Learning automation toolbox, 2018. URL https://github.com/alanturinginstitute/mlaut.
 Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017.

Ojha et al. [2017]
Varun Kumar Ojha, Ajith Abraham, and Václav Snášel.
Metaheuristic Design of Feedforward Neural Networks: A
Review of Two Decades of Research.
Engineering Applications of Artificial Intelligence
, 2017.  Pedregosa et al. [2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research, 2011.
 Probst et al. [2018] Philipp Probst, Bernd Bischl, and AnneLaure Boulesteix. Tunability: Importance of Hyperparameters of Machine Learning Algorithms. arXiv:1802.09596 [stat], 2018.
 Ross [2010] Sheldon M Ross. Introductory Statistics  3rd Edition. Academic Press, 2010.
 Sansone and De Natale [2017] Emanuele Sansone and Francesco G. B. De Natale. Training Feedforward Neural Networks with Standard Logistic Activations is Feasible. arXiv:1710.01013 [cs, stat], 2017.
 ScikitLearn [2018] ScikitLearn. Model selection: choosing estimators and their parameters — scikitlearn 0.20.0 documentation, 2018.
 Seide and Agarwal [2016] Frank Seide and Amit Agarwal. CNTK: Microsoft’s OpenSource DeepLearning Toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY, USA, 2016. ACM. ISBN 9781450342322. doi: 10.1145/2939672.2945397.
 Sonnenburg et al. [2010] Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtěch Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, pages 1799–1802, 2010.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 2014.

Terpilowski [2018]
Maksim Terpilowski.
scikitposthocs: Statistical posthoc analysis and outlier detection algorithms, 2018.
URL http://github.com/maximtrp/scikitposthocs.  Thomas et al. [2018] Janek Thomas, Stefan Coors, and Bernd Bischl. Automatic gradient boosting. arXiv preprint arXiv:1807.03873, 2018.
 Wainer [2016] Jacques Wainer. Comparison of 14 different families of classification algorithms on 115 binary datasets. arXiv:1606.00930 [cs], 2016.
 Wilcoxon [1945] Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1945.
 Wing et al. [2018] Max Kuhn Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R. Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan, and and Tyler Hunt. caret: Classification and Regression Training, 2018.
Appendix A Further benchmarking results
a.1 paired ttest, without multiple testing correction
max width= BaggingClassifier BaselineClassifier BernoulliNaiveBayes GaussianNaiveBayes t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 0.000 1.000 16.779 0.000 5.456 0.000 6.137 0.000 BaselineClassifier 16.779 0.000 0.000 1.000 11.764 0.000 9.373 0.000 BernoulliNaiveBayes 5.456 0.000 11.764 0.000 0.000 1.000 1.366 0.173 GaussianNaiveBayes 6.137 0.000 9.373 0.000 1.366 0.173 0.000 1.000 GradientBoostingClassifier 1.407 0.161 14.703 0.000 3.719 0.000 4.610 0.000 K_Neighbours 0.734 0.463 16.373 0.000 4.837 0.000 5.600 0.000 NN12layer_wide_with_dropout 10.631 0.000 3.963 0.000 6.273 0.000 4.629 0.000 NN12layer_wide_with_dropout_lr01 13.058 0.000 2.045 0.042 8.562 0.000 6.687 0.000 NN12layer_wide_with_dropout_lr1 13.606 0.000 1.309 0.192 9.183 0.000 7.296 0.000 NN2layerdroputinputlayer_lr001 6.453 0.000 8.206 0.000 2.001 0.047 0.660 0.510 NN2layerdroputinputlayer_lr01 10.186 0.000 4.101 0.000 5.927 0.000 4.347 0.000 NN2layerdroputinputlayer_lr1 11.574 0.000 3.020 0.003 7.230 0.000 5.521 0.000 NN4layerdroputeachlayer_lr0001 5.912 0.000 8.451 0.000 1.557 0.121 0.278 0.781 NN4layerdroputeachlayer_lr01 12.754 0.000 2.109 0.036 8.336 0.000 6.514 0.000 NN4layerdroputeachlayer_lr1 12.640 0.000 2.350 0.020 8.177 0.000 6.346 0.000 NN4layer_thin_dropout 6.405 0.000 8.003 0.000 2.042 0.042 0.723 0.471 NN4layer_thin_dropout_lr01 12.112 0.000 2.299 0.022 7.839 0.000 6.117 0.000 NN4layer_thin_dropout_lr1 13.293 0.000 1.411 0.159 8.937 0.000 7.100 0.000 NN4layer_wide_no_dropout 4.958 0.000 9.704 0.000 0.477 0.634 0.742 0.459 NN4layer_wide_no_dropout_lr01 12.554 0.000 2.500 0.013 8.067 0.000 6.234 0.000 NN4layer_wide_no_dropout_lr1 12.618 0.000 2.316 0.021 8.174 0.000 6.351 0.000 NN4layer_wide_with_dropout 5.043 0.000 9.661 0.000 0.548 0.584 0.680 0.497 NN4layer_wide_with_dropout_lr01 13.170 0.000 1.892 0.060 8.690 0.000 6.813 0.000 NN4layer_wide_with_dropout_lr1 12.877 0.000 2.118 0.035 8.416 0.000 6.568 0.000 PassiveAggressiveClassifier 2.876 0.004 13.497 0.000 2.313 0.022 3.371 0.001 RandomForestClassifier 0.253 0.800 16.752 0.000 5.597 0.000 6.262 0.000 SVC 0.607 0.544 15.781 0.000 4.660 0.000 5.443 0.000 max width= GradientBoostingClassifier K_Neighbours NN12layer_wide_with_dropout NN12layer_wide_with_dropout_lr01 t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 1.407 0.161 0.734 0.463 10.631 0.000 13.058 0.000 BaselineClassifier 14.703 0.000 16.373 0.000 3.963 0.000 2.045 0.042 BernoulliNaiveBayes 3.719 0.000 4.837 0.000 6.273 0.000 8.562 0.000 GaussianNaiveBayes 4.610 0.000 5.600 0.000 4.629 0.000 6.687 0.000 GradientBoostingClassifier 0.000 1.000 0.749 0.454 9.091 0.000 11.374 0.000 K_Neighbours 0.749 0.454 0.000 1.000 10.190 0.000 12.634 0.000 NN12layer_wide_with_dropout 9.091 0.000 10.190 0.000 0.000 1.000 1.837 0.068 NN12layer_wide_with_dropout_lr01 11.374 0.000 12.634 0.000 1.837 0.068 0.000 1.000 NN12layer_wide_with_dropout_lr1 11.936 0.000 13.193 0.000 2.470 0.014 0.665 0.507 NN2layerdroputinputlayer_lr001 5.027 0.000 5.953 0.000 3.805 0.000 5.751 0.000 NN2layerdroputinputlayer_lr01 8.702 0.000 9.748 0.000 0.190 0.850 2.002 0.046 NN2layerdroputinputlayer_lr1 10.008 0.000 11.144 0.000 0.855 0.394 0.961 0.338 NN4layerdroputeachlayer_lr0001 4.541 0.000 5.415 0.000 4.099 0.000 6.021 0.000 NN4layerdroputeachlayer_lr01 11.115 0.000 12.332 0.000 1.734 0.084 0.084 0.933 NN4layerdroputeachlayer_lr1 10.985 0.000 12.214 0.000 1.540 0.125 0.295 0.768 NN4layer_thin_dropout 5.013 0.000 5.914 0.000 3.686 0.000 5.602 0.000 NN4layer_thin_dropout_lr01 10.559 0.000 11.693 0.000 1.474 0.142 0.310 0.757 NN4layer_thin_dropout_lr1 11.665 0.000 12.881 0.000 2.338 0.020 0.549 0.584 NN4layer_wide_no_dropout 3.581 0.000 4.436 0.000 5.141 0.000 7.126 0.000 NN4layer_wide_no_dropout_lr01 10.891 0.000 12.125 0.000 1.416 0.158 0.427 0.670 NN4layer_wide_no_dropout_lr1 10.972 0.000 12.193 0.000 1.560 0.120 0.269 0.788 NN4layer_wide_with_dropout 3.658 0.000 4.520 0.000 5.091 0.000 7.079 0.000 NN4layer_wide_with_dropout_lr01 11.489 0.000 12.748 0.000 1.968 0.050 0.138 0.891 NN4layer_wide_with_dropout_lr1 11.215 0.000 12.454 0.000 1.751 0.081 0.079 0.937 PassiveAggressiveClassifier 1.370 0.172 2.238 0.026 7.982 0.000 10.251 0.000 RandomForestClassifier 1.618 0.107 0.976 0.330 10.700 0.000 13.096 0.000 SVC 0.791 0.430 0.086 0.931 9.919 0.000 12.269 0.000 max width= NN12layer_wide_with_dropout_lr1 NN2layerdroputinputlayer_lr001 NN2layerdroputinputlayer_lr01 NN2layerdroputinputlayer_lr1 t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 13.606 0.000 6.453 0.000 10.186 0.000 11.574 0.000 BaselineClassifier 1.309 0.192 8.206 0.000 4.101 0.000 3.020 0.003 BernoulliNaiveBayes 9.183 0.000 2.001 0.047 5.927 0.000 7.230 0.000 GaussianNaiveBayes 7.296 0.000 0.660 0.510 4.347 0.000 5.521 0.000 GradientBoostingClassifier 11.936 0.000 5.027 0.000 8.702 0.000 10.008 0.000 K_Neighbours 13.193 0.000 5.953 0.000 9.748 0.000 11.144 0.000 NN12layer_wide_with_dropout 2.470 0.014 3.805 0.000 0.190 0.850 0.855 0.394 NN12layer_wide_with_dropout_lr01 0.665 0.507 5.751 0.000 2.002 0.046 0.961 0.338 NN12layer_wide_with_dropout_lr1 0.000 1.000 6.349 0.000 2.624 0.009 1.602 0.111 NN2layerdroputinputlayer_lr001 6.349 0.000 0.000 1.000 3.552 0.000 4.663 0.000 NN2layerdroputinputlayer_lr01 2.624 0.009 3.552 0.000 0.000 1.000 1.031 0.304 NN2layerdroputinputlayer_lr1 1.602 0.111 4.663 0.000 1.031 0.304 0.000 1.000 NN4layerdroputeachlayer_lr0001 6.609 0.000 0.354 0.724 3.846 0.000 4.944 0.000 NN4layerdroputeachlayer_lr01 0.741 0.460 5.600 0.000 1.899 0.059 0.868 0.386 NN4layerdroputeachlayer_lr1 0.954 0.341 5.430 0.000 1.708 0.089 0.668 0.505 NN4layer_thin_dropout 6.194 0.000 0.070 0.945 3.439 0.001 4.533 0.000 NN4layer_thin_dropout_lr01 0.950 0.343 5.247 0.000 1.640 0.102 0.627 0.531 NN4layer_thin_dropout_lr1 0.109 0.914 6.174 0.000 2.493 0.013 1.479 0.141 NN4layer_wide_no_dropout 7.710 0.000 1.339 0.182 4.864 0.000 5.999 0.000 NN4layer_wide_no_dropout_lr01 1.087 0.278 5.319 0.000 1.587 0.114 0.542 0.588 NN4layer_wide_no_dropout_lr1 0.927 0.355 5.439 0.000 1.728 0.085 0.691 0.491 NN4layer_wide_with_dropout 7.664 0.000 1.281 0.201 4.815 0.000 5.951 0.000 NN4layer_wide_with_dropout_lr01 0.528 0.598 5.874 0.000 2.130 0.034 1.093 0.275 NN4layer_wide_with_dropout_lr1 0.740 0.460 5.643 0.000 1.916 0.057 0.879 0.380 PassiveAggressiveClassifier 10.834 0.000 3.867 0.000 7.614 0.000 8.909 0.000 RandomForestClassifier 13.640 0.000 6.571 0.000 10.261 0.000 11.632 0.000 SVC 12.823 0.000 5.808 0.000 9.503 0.000 10.848 0.000 max width= NN4layerdroputeachlayer_lr0001 NN4layerdroputeachlayer_lr01 NN4layerdroputeachlayer_lr1 NN4layer_thin_dropout t_stat p_val t_stat p_val t_stat p_val t_stat p_val BaggingClassifier 5.912 0.000 12.754 0.000 12.640 0.000 6.405 0.000 BaselineClassifier 8.451 0.000 2.109 0.036 2.350 0.020 8.003 0.000 BernoulliNaiveBayes 1.557 0.121 8.336 0.000 8.177 0.000 2.042 0.042 GaussianNaiveBayes 0.278 0.781 6.514 0.000 6.346 0.000 0.723 0.471 GradientBoostingClassifier 4.541 0.000 11.115 0.000 10.985 0.000 5.013 0.000 K_Neighbours 5.415 0.000 12.332 0.000 12.214 0.000 5.914 0.000 NN12layer_wide_with_dropout 4.099 0.000 1.734 0.084 1.540 0.125 3.686 0.000 NN12layer_wide_with_dropout_lr01 6.021 0.000 0.084 0.933 0.295 0.768 5.602 0.000 NN12layer_wide_with_dropout_lr1 6.609 0.000 0.741 0.460 0.954 0.341 6.194 0.000 NN2layerdroputinputlayer_lr001 0.354 0.724 5.600 0.000 5.430 0.000 0.070 0.945 NN2layerdroputinputlayer_lr01 3.846 0.000
Comments
There are no comments yet.