AMLB: an AutoML Benchmark

Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/01/2019

An Open Source AutoML Benchmark

In recent years, an active field of research has developed around automa...
02/07/2020

A Comprehensive Feature Comparison Study of Open-Source Container Orchestration Frameworks

(1) Background: Container orchestration frameworks provide support for m...
07/21/2022

UniFed: A Benchmark for Federated Learning Frameworks

Federated Learning (FL) has become a practical and popular paradigm in m...
12/18/2018

wav2letter++: The Fastest Open-source Speech Recognition System

This paper introduces wav2letter++, the fastest open-source deep learnin...
01/15/2022

A new approach to evaluating legibility: Comparing legibility frameworks using framework-independent robot motion trajectories

Robots that share an environment with humans may communicate their inten...
03/13/2020

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

We introduce AutoGluon-Tabular, an open-source AutoML framework that req...
08/16/2021

My Fuzzer Beats Them All! Developing a Framework for Fair Evaluation and Comparison of Fuzzers

Fuzzing has become one of the most popular techniques to identify bugs i...

1 Introduction

To create useful machine learning (ML) models from data, data scientists must prepare the data (for example, by encoding categorical features and processing text data), select an ML algorithm, and tune its hyperparameters. This requires extensive expertise, such as knowing which hyperparameters to tune and how 

(Probst et al., 2019; Weerts et al., 2020). Even with this knowledge, it is a time-consuming task, since the best choices are unique to each data set and can be interdependent on each other (Van Rijn and Hutter, 2018).

The field of automated machine learning (AutoML) is focused on automating the design and optimization of ML pipelines in a data-driven way (Hutter et al., 2019)

. Neural Architecture Search (NAS) is an important part of AutoML that automates the design decisions of deep neural networks. AutoML aims to free up valuable time for experts to perform other tasks and allow novice users to train well-performing ML models.

Many different AutoML approaches have been proposed, including sequential model-based optimization (Hutter et al., 2011; Snoek et al., 2012), hierarchical task planning (Erol et al., 1994)

, and genetic programming 

(Koza and Koza, 1992). Novel systems are being developed in both academia and industry, and a recent survey by Van der Blom et al. (2021) showed that % of practitioners (at least partially) adopt automated model selection and hyperparameter configuration.

1.1 The Need for Standardized Benchmarks

With considerable effort being spent on developing and improving AutoML frameworks as well as increased usage by practitioners, there is a need for systematic and in-depth comparisons of the the various approaches to track progress in the field. However, the comparison of AutoML frameworks is prone to several types of error. First, selection bias may be introduced, even accidentally, when authors decide which data sets to use in their evaluation. For example, too few data sets may be selected to accurately evaluate the framework’s strengths and weaknesses, and the chosen data sets may no longer be challenging for current AutoML frameworks. Without a standard suite of data sets to use for evaluation, the selection of data sets is often not reasonably justified and motivated. Moreover, issues may arise from errors in the installation, configuration, or use of ‘competitor’ frameworks. Typical examples are misunderstanding memory management and/or using insufficient compute resources (Balaji and Allen, 2018), or failing to use comparable resource budgets (Ferreira et al., 2021).

Several suites of benchmark data sets have been used in the aforementioned papers, but none have become a standard in the AutoML community. The original selection of data sets by Thornton et al. (2013) was used in several earlier papers (Feurer et al., 2015a; Mohr et al., 2018), but fails to highlight differences between current AutoML frameworks. Most published AutoML papers use a self-selected suit of data sets on which their methods are evaluated. For example, Drori et al. (2018), Rakotoarison et al. (2019) and Gil et al. (2018) all published at the same time but feature different suites on which they evaluate their contributions. This inconsistency makes it impossible to directly compare results across papers and track progress over time. It also potentially allows for presenting cherry-picked results.

1.2 Our Contributions

We introduce a novel AutoML benchmark following best practices to avoid these common pitfalls while stimulating progress towards more standardized benchmarking.111A first look of this tool was presented at the ICML 2019 AutoML Workshop (Gijsbers et al., 2019). The current version is significantly more general, more systematic, and allows much more in-depth analysis. To ensure reproducibility222Using the ACM definition: an independent group can obtain the same result using the author’s own artifacts (https://www.acm.org/publications/policies/artifact-review-and-badging-current)., we provide an open-source benchmarking tool333Code, results, and documentation at: https://openml.github.io/automlbenchmark/ that allows easy integration with AutoML frameworks, and performs end-to-end evaluations thereof on carefully curated sets of open data sets. Our focus is on tabular data. Unstructured data are out of scope for this paper, since they are best tackled with NAS, and benchmarking NAS frameworks imposes different practical constraints, as discussed in Section 2.2.

Our benchmarking tool can be used to perform evaluations of AutoML frameworks in a fully automated way. The AutoML frameworks are integrated with benchmarking tool in direct agreement and jointly with the framework’s developers to ensure correctness. We carry out a large-scale evaluation of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. To better understand how these systems perform across many tasks, we also introduce techniques for detailed comparison of AutoML frameworks, including final model accuracy, inference time trade-offs, and failure analysis. Finally, we provide an interactive visualization tool that may be used for further exploration of all our results or to reproduce the analyses performed in this paper.

In the remainder of this paper, we first discuss related work in Section 2 and cover several key open-source AutoML frameworks in Section 3. Next, we provide an overview of our proposed benchmarking tool in Section 4 and motivate our benchmark design choices in Section 5. The results obtained by running this benchmark are analyzed in Section 6. Finally, Section 7 summarizes our main conclusions and sketches directions for future work.

2 Related Literature

In this section, we motivate why we need benchmarks specifically designed for AutoML, review other work evaluating AutoML frameworks, and finally discuss the relevant ML benchmarking literature. Several benchmark suites have been developed in ML (Van Gestel et al., 2004; Olson et al., 2017; Wu et al., 2018; Bischl et al., 2021). The data sets in these suits often do not include problematic data characteristics found in real world tasks (for example, many missing values), as many ML algorithms are not natively able to handle them. By contrast, in order to be applicable to a wide range of data, AutoML frameworks should be designed to handle these problematic data sets. As such, there is opportunity to allow more such problematic data sets in ML benchmarks and study how well AutoML frameworks handle these issues. Moreover, runtime budgets are crucial in an AutoML benchmark, as most AutoML frameworks are designed to run until a given time budget is exhausted. However, these runtime budgets are often not specified beforehand in traditional ML benchmarks, since algorithms must usually run to completion (one exception are performance studies, such as Kotthaus et al., 2015). Consequently, new benchmarks that have been designed specifically for AutoML frameworks are needed.

2.1 Evaluation of Automated Machine Learning

To establish new best practices for AutoML benchmarking, it is beneficial to study the shortcomings encountered in prior benchmarks as well as lessons learned. Balaji and Allen (2018) conducted one of the first benchmark studies on AutoML frameworks. They evaluated four open-source frameworks on both classification and regression tasks sourced from OpenML (Vanschoren et al., 2014), optimized for weighted F1 score and mean squared error, respectively. Unfortunately, they encountered technical issues with most AutoML frameworks that led to a questionable experimental evaluation. For example, H2O AUTOML (LeDell and Poirier, 2020) was configured to optimize to a different metric (log loss as opposed to weighted F1 score) and ran with a different setup (unlike the others, H2O AUTOML was not containerized), and AUTO_ML (Parry, 2018) had its hyperparameter optimization (HPO) disabled, making for incomparable results. This highlights the need for careful configuration of all AutoML frameworks involved.

A study on nearly 300 data sets across six different frameworks was conducted by Truong et al. (2019). Each experiment consisted of a single 80/20 holdout split on a 15-minute training time budget, which was chosen so that most tools returned a result on at least 70% of the data sets. We postulate it is reasonable to assume that the data sets for which no result is returned by a framework are most often those data sets for which optimization is hard. For example, a large data set might cause one framework to conduct only few evaluations while it completely halts another. Unfortunately, this makes the resulting comparisons uninterpretable, as a framework could seemingly demonstrate better performance simply because it failed to return models on data sets for which optimization was difficult. Hence, it is key that failures must be avoided as much as possible, and any remaining failures should be analysed and taken into account in subsequent analysis.

On the positive side, Truong et al. (2019)

present their results across different subsets of the benchmark—for example, few versus many categorical features—which helps to highlight differences between different frameworks. The authors also conduct small-scale experiments to analyze performance over time by running the tools on multiple time budgets on a subset of data sets as well as the ‘robustness’, which denotes the variance in final performance given the same input data. Unfortunately, both experiments were conducted on only one data set per sub-category, which does not lend to generalizing the results. Still, studying the impact of data characteristics and budget sizes should ideally be part of AutoML benchmark design.

Zöller and Huber (2021) present a survey on AutoML and combined algorithm selection and HPO frameworks (CASH, Thornton et al., 2013). Six CASH and five AutoML frameworks are compared across 137 classification tasks, with the former limited to 325 iterations and the latter constrained to a one hour time budget. Among the CASH frameworks, hyperopt (Bergstra et al., 2013) performed best, although absolute differences were small between all optimizers. The AutoML frameworks are compared as they are, which might reflect common use. However, by not controlling their settings, it becomes impossible to draw conclusions about the effectiveness of individual parts of AutoML systems. A number of errors of AutoML frameworks are observed, including memory constraint violations, segmentation faults, and Java server crashes. The authors also find that most frameworks construct rather modest pipelines (with few preprocessing operators). As such, it is recommended that AutoML frameworks are controlled carefully, use a wide operator search space, and are evaluated on datasets that require non-trivial preprocessing.

Kaggle444https://www.kaggle.com/

, a platform for data science competitions, is sometimes used to compare AutoML frameworks to human data scientists 

(Zöller and Huber, 2021; Erickson et al., 2020). Zöller and Huber, 2021 found that the rankings of AutoML frameworks on benchmark datasets are very different from the rankings on competitions. Furthermore, humans still find better solutions than the examined AutoML frameworks. However, it is hard to interpret such results. Submissions on Kaggle leaderboards range from serious attempts to random test runs, and a significant portion of them do not outperform a simple baseline. Finally, most Kaggle results are several years old and possibly no longer represent state-of-the-art human-made models because of advances in available hardware or algorithms.

A recent AutoML benchmark for multi-label classification Wever et al. (2021) proposed a general tool with a configurable search space and optimizer, which allows for the inclusion of new methods and ablation studies. Unfortunately, this approach requires that existing AutoML frameworks be re-implemented within this tool, which is difficult in such a rapidly developing field. As such, we need a simpler way to include existing and new AutoML frameworks into benchmarking tools while still allowing control over their configuration.

In addition to AutoML benchmarks, a series of competitions for tabular AutoML was hosted (Guyon et al., 2019). The first two competitions focused on tabular AutoML, where data is assumed to be independent and identically distributed. In these competitions, participants submitted code that automatically builds a model on given data and produced predictions for a test set. During the development phase, competitors could make use of a public leaderboard and several validation data sets. After the development phase, the latest submissions of each participant would be evaluated on a set of new data sets to determine the final ranking. These data sets consisted of a mix of both new data and data taken from public repositories, although they were reformatted to conceal their identity. In their analysis Guyon et al. (2019) reveal, most methods failed to return results on at least some data sets due to practical issues (such as running out of memory).

2.2 Other (Benchmark) Literature

Instead of benchmarking whole (Auto-) ML frameworks, it is often useful to focus on various sub-parts and optimize them step-by-step. One part of such a system is HPO, and its available algorithms are as numerous and diverse as the learning algorithms themselves. Consequently, there exist various benchmarking suites to foster research by comparing those black box optimization algorithms.

Like non-black-box optimization, black box optimizers are typically evaluated on synthetic test functions or on real-world tasks. A recent benchmarking suite for continuous optimization is COCO

(Hansen et al., 2021), which includes a collection of various synthetic black-box benchmark functions. kurobako (Ohta and Yamazaki, 2022) provides various general black-box optimizers and benchmark problems, while LassoBench (Šehić et al., 2021) is suitable for benchmarking high-dimensional optimization problems. Especially for Bayesian Optimization, Bayesmark (Turner, 2022) combines several benchmarks on real-world tasks.

One of the first benchmarks for empirically evaluating HPO algorithms was HPOlib (Eggensperger et al., 2013), which allows accessing real-world HPO tasks, tabular and surrogate benchmarks, and synthetic test functions using a common API. This benchmark was used by Bergstra et al. (2014) in their empirical benchmark studies. HPOBench (Eggensperger et al., 2021) is a similar successor of HPOlib that concentrates on reproducible containerized benchmarks and multi-fidelity optimization problems. Based on OpenML (Vanschoren et al., 2014), Arango et al. (2021) recently introduced HPO-B, a large-scale reproducible benchmark for transfer-HPO methods. In contrast, PROFET (Klein et al., 2019) uses a generative meta-model to generate synthetic but realistic benchmark instances.

A widely related benchmark is NAS-Bench-101 (Ying et al., 2019)

, which is a tabular data set that maps convolutional neural network architectures to their trained and evaluated performance on CIFAR-10. Its goal is to make NAS more accessible, despite the tremendous demand of computational resources. Additionally, there exists a series of NAS-Bench that includes

NAS-Bench-1shot1 (Zela et al., 2020), NAS-Bench-301 (Siems et al., 2020), and others.

3 AutoML Tools

Automated ML pipeline design was first explored by Escalante et al. (2009), but the first prominent AutoML framework was AUTO-WEKA (Thornton et al., 2013). AUTO-WEKA used Bayesian optimization to select and tune the algorithms in an ML pipeline based on WEKA (Hall et al., 2009). Since then, a plethora of new AutoML frameworks have been developed, either by iteratively improving on old designs or using novel approaches. In this section, we will discuss the tools we considered for our benchmark.

Unfortunately the cost of evaluating all frameworks is prohibitive, so we selected only 9 of them to evaluate in this work. Only open source tools were considered. From those, we made selections to cover a variety of different approaches. We considered both frameworks developed by industry as well as academia and included packages whose authors proactively integrated their AutoML framework.

The most notable omission is AUTO-WEKA, which we decided to exclude based on the performance in our 2019 evaluation and its lack of updates since (Gijsbers et al., 2019). Some frameworks integrated with the benchmark tool are not evaluated in this paper. The authors of AUTOXGBOOST (Thomas et al., 2018) opted out of an evaluation because it is built on deprecated software with no plans for updates. ML-PLAN (Mohr et al., 2018), MLR3AUTOML555Source and documentation of MLR3AUTOML at https://github.com/a-hanf/mlr3automl/., and MLNET666Source and documentation of MLNET at https://github.com/dotnet/machinelearning/. are excluded because we encountered significant technical problems evaluating the systems. We hope to publish results from these frameworks at a later date. There are still many tools not yet integrated with the AutoML benchmark, including OBOE (Yang et al., 2018),

AUTO-KERAS

 (Jin et al., 2019), and AUTOPYTORCH (Zimmer et al., 2021) among others.

3.1 Integrated Frameworks

Table 1 offers an overview of the AutoML tools evaluated in this paper alongside a simplified description of their AutoML design, which are elaborated below. We refer the interested reader to the original publications and our website, which has links to the tools’ source code and documentation.

Framework Optimization Search Space Reference
AUTOGLUON custom predefined pipelines Erickson et al. (2020)
AUTO-SKLEARN Bayesian SCIKIT-LEARN pipelines Feurer et al. (2015a)
AUTO-SKLEARN 2 Bayesian iterative algorithms Feurer et al. (2020)
FLAML CFO iterative algorithms Wang et al. (2021)
GAMA Evolution SCIKIT-LEARN pipelines Gijsbers and Vanschoren (2021)
H2O AUTOML Random Search H2O pipelines LeDell and Poirier (2020)
LIGHTAUTOML Bayesian Linear model, GBM Vakhrushev et al. (2021)
MLJAR custom python modules Płońska and Płoński (2021)
TPOT Evolution SCIKIT-LEARN pipelines Olson and Moore (2016)
Table 1: Used AutoML frameworks in the experiments.

3.1.1 AutoGluon-Tabular

AUTOGLUON automates ML across a variety of tasks, including image, text, and tabular data. The subsystem that automates ML on tabular data is called AUTOGLUON-TABULAR (Erickson et al., 2020), but we will refer to it as AUTOGLUON. In contrast to other AutoML systems discussed here, AUTOGLUON does not perform a pipeline search or hyperparameter tuning. Instead, it has a predetermined set of models that are combined through multi-layer stacking and ensembling.

AUTOGLUON’s ensemble consists of three layers. The first layer consists of models from a range of model families trained directly on the data. In the second layer, the same type of models are considered but rather as a stacking learner trained with both the input data and the predictions of the first layer. In the final layer, the predictions of the second-layer models are combined into an ensemble (Caruana et al., 2004), an ensemble method first used in AutoML by AUTO-SKLEARN (Feurer et al., 2015a).

To adhere to time constraints, AUTOGLUON may stop iterative algorithms prematurely or forgo training certain models altogether. Given more time, AUTOGLUON will train additional models using the same algorithms and hyperparameter configurations on different data splits, which further improves the generalization of the stacking layer.

3.1.2 auto-sklearn

Based on the design of AUTO-WEKA, AUTO-SKLEARN (Feurer et al., 2015a) also uses Bayesian optimization but is instead implemented in Python and optimizes pipelines built with SCIKIT-LEARN (Pedregosa et al., 2011). Additionally, it warm-starts optimization through meta-learning, starting pipeline search with the best pipelines for the most similar data sets (Feurer et al., 2015b). After pipeline search has concluded, an ensemble is created from pipelines trained during search using the procedure described by Caruana et al. (2004, 2006). AUTO-SKLEARN has won two AutoML challenges (Guyon et al., 2019), although for both entries, AUTO-SKLEARN was customized for the competition, and not all changes are found in the public releases (Feurer et al., 2018).

Based on experience from the challenges, AUTO-SKLEARN 2 was subsequently developed (Feurer et al., 2020). The most notable changes include reducing the search space to only iterative learning algorithms and excluding most preprocessing, use of successive halving (Jamieson and Talwalkar, 2016), adaptive evaluation strategies, and replacing the data-specific warm-start strategy with a data-agnostic portfolio of pipelines. Because these changes make version 2.0 almost entirely different from 1.0, and 1.0 has been updated since our last evaluation, we evaluate both auto-sklearn versions in this paper. However, AUTO-SKLEARN 2 does not yet support regression, and its heavy use of meta-learning made it impossible for us to perform a ‘clean’ evaluation at this time (see Section 5.3.3).

3.1.3 Flaml

The Fast and Lightweight AutoML Library (FLAMLWang et al., 2021) optimizes boosting frameworks (XGBOOSTChen and Guestrin, 2016, CATBOOSTProkhorenkova et al., 2018, and LIGHTGBMKe et al., 2017) and a small selection of SCIKIT-LEARN algorithms through a multi-fidelity randomized directed search (Wu et al., 2021). This search is based on an expected cost for improvement, which tracks the expected computational cost of improving over the best found model so far for each learner. Only after choosing which learner to tune, hyperparameter optimization proceeds by a randomized directed search, sampling a new configuration from a unit sphere around the previous sample point. After evaluating its validation performance, the next sample point is moved to that direction (if better) or the opposite direction (if worse). FLAML positions itself as a fast AutoML framework that is designed to work for small time budgets (Wang et al., 2021).

3.1.4 Gama

Designed as a modular AutoML tool for researchers, GAMA’s search method and post processing are easily configurable (Gijsbers and Vanschoren, 2021). By default, GAMA uses genetic programming to optimize linear ML pipelines with an arbitrary amount of preprocessing algorithms. Similar to TPOT, GAMA

’s evolutionary algorithm uses NSGA-II to perform multi-object optimization 

(Deb et al., 2002), maximizing performance while minimizing the number of components in the pipeline. By contrast, GAMA’s evolutionary algorithm is asynchronous and does not work with distinct generations, which allows for higher resource utilization. The final model is constructed through ensemble selection (Caruana et al., 2004, 2006), similar to AUTO-SKLEARN.

3.1.5 H2O AutoML

Built on the H2O ML platform (H2O.ai, 2013), H2O AUTOML (LeDell and Poirier, 2020)

evaluates a portfolio of algorithm configurations and also performs a random search over the majority of the supervised learning algorithms offered in

H2O. To maximize model performance, H2O AUTOML also trains two types of stacked ensemble models at various stages during the run: an ensemble using all available models at time , and an ensemble with only the best models of each algorithm type at time . H2O AUTOML aims to cover a large search space quickly and relies on stacking to boost model performance. Additionally, H2O AUTOML

uses a predefined strategy for imputation, normalization, and categorical encoding for each algorithm and does not currently optimize over preprocessing pipelines. The

H2O AUTOML algorithm is designed to generate models that are very fast at inference time, rather than strictly focusing on maximizing model accuracy, with the goal of balancing these two competing factors to produce practical models suited for production environments.

3.1.6 LightAutoML

LIGHTAUTOML is specifically designed with applications in the financial services industry in mind (Vakhrushev et al., 2021). In this framework, pipelines are designed for quick inference and interpretability. Only linear models and boosting frameworks (CATBOOST and LIGHTGBM

) are considered. Expert rules are used to evaluate likely good hyperparameter configurations and design the search space. Tree-structured Parzen Estimators 

(Bergstra et al., 2011) are used to optimize hyperparameters of the boosting frameworks, while warm-starting and early stopping are used to optimize linear models with grid search. Different models are combined in either a weighted voting ensemble (binary classification and regression) or with two levels of stacking (multi-class classification). In a special “compete” mode for larger time budgets, the AutoML pipeline is run multiple times with different configurations and their resulting models are ensembled with weighted voting, which allows for a more robust model.

3.1.7 MLJar

MLJAR (Płońska and Płoński, 2021) starts its search with a set of predetermined models and a limited random search—similar to H2O AUTOML. This is followed by a feature creation and selection step, after which a hill climbing algorithm is used to further tune the best pipelines. After search, the models can be stacked, used in a voting ensemble, or both. Learners from SCIKIT-LEARN are considered, and boosting packages XGBOOST, CATBOOST, and LIGHTGBM. MLJAR features multiple modes for different use cases, including exploratory data analysis or finding a fast model, but also a ‘compete’ mode, which aims to find the best possible model and is used in our experiments.

3.1.8 Tpot

Tree-based Pipeline Optimization Tool (Olson and Moore, 2016), or TPOT, optimizes pipelines using genetic programming. Using a grammar, ML pipelines can be expressed as trees where different branches represent distinct preprocessing pipelines. These pipelines are then optimized through evolutionary optimization. To reduce overfitting that may arise from the large search space, multi-objective optimization is used to minimize for pipeline complexity while optimizing for performance (Olson et al., 2016). It is also possible to reduce the search space by specifying a pipeline template (Le et al., 2018)

, which dictates the high-level steps in the pipeline (for example, a “Selector-Transformer-Classifier” template will result in pipelines with only those three steps, in that order). Development has focused around genomic studies, providing specific options for dealing with this type of high dimensional data for which prior knowledge may be present 

(Sohn et al., 2017). While TPOT supports neural networks in its search (Romano et al., 2021), the default search space uses SCIKIT-LEARN components and XGBOOST only.

3.2 Baselines

In addition to the integrated frameworks, the benchmark tool allows for running several baselines. The constant predictor always predicts the class prior or mean target value, regardless of the values of the independent variables. The Random Forest baseline builds a forest trees at a time, until one of two criteria is met: we expect to exceed % of the memory limit or time limit by building 10 more trees, or trees have been built.

The Tuned Random Forest baseline improves on the Random Forest baseline by using an optimized max_features value. The max_features hyperparameter defines how many features are considered when determining the best split and is found to be the most important hyperparameter (Van Rijn and Hutter, 2018) 777 The hyperparameter min. samples leaf is more important, but not significantly. Although it may not be immediately obvious, the absolute values used for min. samples leaf also transfer to our data sets as the relative values used in max features. . The value is optimized by evaluating up to unique values for the hyperparameter with -fold cross-validation before training a final model with the best found value. The Tuned Random Forest is our strongest baseline and aims to mimic a typical first approach for modeling by a human.

Recently, Mohr and Wever (2021)

proposed to introduce a baseline that aims to emulate the optimization that a human expert might perform. The baseline includes several steps, including feature scaling, feature selection, hyperparameter tuning, and model selection. We omit this baseline here because it does not support regression and it was published late into our preparation for the experiments.

4 Software

We developed an open source benchmark tool that may be used for reproducible AutoML benchmarking.888Source code and documentation under MIT license at https://github.com/openml/automlbenchmark. This tool features robust automated experiment execution and supports multiple AutoML frameworks, many of which are evaluated in this paper. The benchmark tool is implemented as a Python application consisting mainly of an amlb module and a framework folder hosting all the officially supported extensions, which have been developed together with AutoML framework developers. The main consideration for the design of the benchmark tool is to produce correct and reproducible evaluations; the AutoML tools are used as intended by their authors with little to no room for user error, and the same evaluation conditions (including framework version, data set, and resampling splits) and controlled computational environments can easily be recreated by anyone. The amlb module provides the following features:

  • a data loader to retrieve and prepare data from OpenML or local data sets.

  • various benchmark runner implementations:

    • a local runner that runs the experiments directly on the machine. This is also the runner to which each runner below delegates the final execution.

    • container runners (docker and singularity are currently supported), which allow preinstalling the amlb application together with a full setup of one framework and consistently run all benchmark tasks against the same setup. This implementation also makes it possible to run multiple container instances in parallel.

    • an aws runner that allows the user to safely run the benchmark on several EC2 instances in parallel. Each EC2 instance can itself use a pre-built docker image, as used for this paper, or can configure the target framework on the fly, which is useful for experiments in a development environment.

  • a job executor responsible for running and orchestrating all the tasks. When used with the aws runner, this allows distribution of the benchmark tasks across hundreds of EC2 instances in parallel, with each one being monitored remotely by the host.

  • a post-processor responsible for collecting and formatting the predictions returned by the frameworks, handling errors, and computing the scoring metrics before writing the information needed for post-analysis to a file.

Figure 13 in Appendix C provides an architecture overview and description of the flow of the benchmarking tool, as used in the experiments for this paper.

4.1 Extensible Framework Structure

To ensure that the benchmark tool is easily extensible for new AutoML frameworks, we integrate each tool through a minimal interface. Each of the current tools requires less than 200 lines of code across at most four files (most of which is boilerplate). The integration code handles installation of the AutoML framework as well as its software stack and provides the framework with data and recording predictions. The integration requirements are minimal, as both input data and predictions can be exchanged both in Python objects and common file formats, which makes integration across programming languages possible (currently integrated frameworks are written in C, Java, Python and R). By keeping the integration requirements minimal, we hope that AutoML framework authors are encouraged to contribute integration scripts for their framework and, at the same time, avoid influencing the methods or software used to design and develop new AutoML frameworks (as opposed to providing a generic starter kit, which may bias the developed AutoML frameworks, Guyon et al., 2019). Frameworks may also be integrated completely locally to allow for private benchmarking.999For information on how to add a framework, see https://github.com/openml/automlbenchmark/blob/master/docs/HOWTO.md#add-an-automl-framework.

4.2 Extensible Benchmarks

Benchmark suites define the data sets and one or more train/test splits, which should be used to evaluate the AutoML frameworks. The benchmark tool can work directly with OpenML tasks and suites (Bischl et al., 2021), allowing for new evaluations without further changes to the tool or its configuration. This is the preferred way to use the benchmark tool for scientific experiments, as it guarantees that the exact evaluation procedure can be reproduced easily by others. However, it is also possible to use data sets stored in local files with manually defined splits, e.g., to benchmark private use cases.101010For information on how to add a new benchmark task or suite, see https://github.com/openml/automlbenchmark/blob/master/docs/HOWTO.md#add-a-benchmark.

4.3 Using the Software

To benchmark an AutoML framework, the user must first identify and define:

  • the framework against which the benchmark is executed,

  • the benchmark suite listing the tasks to use in the evaluation, and

  • the constraint that must be imposed on each task. This includes:

    • the maximum training time.

    • the amount of CPU cores that can be used by the framework; not all frameworks respect this constraint, but when run in aws mode, this constraint translates to specific EC2 instances, therefore limiting the total amount of CPUs available to the framework.

    • the amount of memory that can be used by the framework; not all frameworks respect this constraint, but when run in aws mode, this constraint translates to specific EC2 instances, therefore limiting the total amount of memory available to the framework.

    • the amount of disk volume that can be used by the framework (only respected in aws mode).

    Those constraints must then be declared explicitly in a constraints.yaml file (also in the resources folder or as an external extension).

4.3.1 Commands

Once the previous parameters have been defined, the user can run a benchmark on the command line using the basic syntax:

  $ python runbenchmark.py framework_id benchmark_id constraint_id

For example, to evaluate the tuned random forest baseline on the classification suite:

  $ python runbenchmark.py tunedrandomforest  openml/s/271 1h8c

Additional options may be used to specify the mode, the parallelization, and other details of the experimental setup. For example, the following command may be used to evaluate the random forest baseline on the regression benchmark suite across 100 8-core aws instances in parallel with a time budget of one hour.

  $ python runbenchmark.py randomforest  openml/s/269 1h8c -m aws -p 10

5 Benchmark Design

In this section, we discuss both the design of the benchmark suite (that is, the chosen data sets and evaluation procedures, Bischl et al., 2021) and the experimental setup, as well as their limitations.

5.1 Benchmark Suites

To facilitate a reproducible experimental evaluation, we make use of OpenML Benchmark suites (Bischl et al., 2021). An OpenML benchmark suite is a collection of OpenML tasks, which each reference a data set, an evaluation procedure (such as k-fold cross-validation) and its splits, the target feature, and the type of task (regression or classification). The benchmark suites are designed to reflect a wide range of realistic use-cases for which the AutoML tools are designed. Resource constraints are not part of the task definition. Instead, we define them separately in a local file so that each task can be evaluated with multiple resource constraints. Both the OpenML benchmark suite (and tasks) and the resource constraints are machine-readable to ensure automated and reproducible experiments.

5.1.1 Data Sets

We created two benchmarking suites, one with 71 classification tasks, and one with 33 regression tasks. The data sets used in these tasks are selected from previous AutoML papers (Thornton et al., 2013), competitions (Guyon et al., 2019), and ML benchmarks (Bischl et al., 2021) according to the following predefined list of criteria:

  • Difficulty

    of the data set must be sufficiently high. If a problem is easily solved by almost any algorithm, it will not be able to differentiate the various AutoML frameworks. This can mean that simple models (such as random forests, decision trees or logistic regression) achieve a generalization error of zero, or that the performance of these models and all evaluated AutoML tools is identical.

  • Representative of real-world

    data science problems to be solved with the tool. In particular, we limit artificial problems. We included a small selection of such problems, either based on their widespread use (kr-vs-kp) or because they pose difficult problems, but we do not want them to constitute a large part of the benchmark. We also limit computer vision problems on raw pixel data because those problems are typically solved with dedicated deep learning solutions. However, since they still make for real-world, interesting, and hard problems, we did not exclude them altogether.

  • No free form text features that cannot reasonably be interpreted as a categorical feature. Most AutoML frameworks do not yet support feature engineering on text features and will process them as categorical features. For this reason, we exclude text features, even though we admit their prevalence in many interesting real-world problems. A first investigation and benchmark of multimodal AutoML with text features has been carried out by Shi et al. (2021).

  • Diversity

    in the problem domains. We do not want the benchmark to skew towards any application domain in particular. There are various software quality problems in the OpenML-CC18 benchmark (jm1, kc1, kc2, pc1, pc3, pc4), but adopting them all would lead to a bias in the benchmark to this domain.

  • Independent and identically distributed (i.i.d.) data are required for each task. If the data are of temporal nature or repeated measurements have been conducted, the task is discarded. Both types of data are generally very interesting but are currently not supported for most AutoML systems, and we plan to extend the benchmark in the future in this direction.

  • Freely available and hosted on OpenML. Data sets that can only be used on specific platforms, like Kaggle, or are not shared freely for any reasons are not included in the benchmark.

  • Miscellaneous

    reasons to exclude a data set included label-leakage and near-duplicates of other tasks in independent variables (for example, different only in categorical encoding or imputation) or dependent variable (most commonly the binarization of a regression of multi-class task).

To study the differences between AutoML systems, the data sets vary in the number of samples and features by orders of magnitude and vary in the occurrence of numeric features, categorical features, and missing values. Figure 1 shows basic properties of the classification and regression tasks, including the distributions of the number of instances and features, the frequency of missing values and categorical features, and the number of target classes (for classification tasks). Properties of the tasks are shown in Appendix A and can be explored interactively on OpenML.111111Visit www.openml.org/s/269 for classification and www.openml.org/s/271 for regression. While the selection spans a wide range of data types and problem domains, we recognize that there is room for improvement. Restricting ourselves to open data sets without text features severely limits options, especially for big data sets.

Figure 1: Properties of the tasks in both benchmarking suites.

All data sets are available in multiple formats for the AutoML frameworks, either as files (parquet, arff, or csv) or as Python object (pandas dataframe, numpy array), and the used format depends on the AutoML framework. All frameworks have access to meta-data, such as the datatype of the columns, either directly from the chosen data format or as separate input, so that each AutoML framework has the same information available regardless of data format.

5.1.2 Performance Metrics

In our evaluation, we use area under the receiver operating characteristic curve (AUC) for binary classification, log loss for multi-class classification and root mean-squared error (rmse) for regression to evaluate model performance.

121212We use the implementations provided by SCIKIT-LEARN 0.24.2. We chose to use these metrics because they are generally reasonable, commonly used in practice, and supported by most AutoML tools. The latter is especially important because it is imperative that AutoML systems optimize for the same metric on which they are evaluated. However, our tool is not limited to only these three metrics, and a wide range of performance metrics can be specified by the user.

5.1.3 Missing Values In Experimental Results

As will be discussed in more detail in Section 6.4, not all frameworks are equally well-behaved. There are situations where search time budgets are exceeded or the AutoML frameworks crash outright, which results in missing performance estimates. There are multiple strategies to consider on how to deal with these missing data.

One naive approach may be to ignore missing values and aggregate over the obtained results. However, we observe that failures do not occur at random. Failures correlate with data set properties, such as data set size and class imbalance, which may be correlated with “problem difficulty” and thus performance. Ignoring missing values thus means that AutoML frameworks may fail on harder tasks or folds and consequently obtain higher average performance estimates. Imputing missing values with performance obtained by the same AutoML framework on other folds is subject to the same drawback. Moreover, in case a framework fails to produce predictions on all folds of a task, neither method specifies how to deal with missing values.

Instead, we propose to impute the missing values with an interpretable and reliable baseline. An argument may be made for using the random forest baseline, since this may be a strong fallback that AutoML frameworks could realistically implement. However, we observe that training a random forest (of the size used in the baseline) requires an insignificant amount of time on some data sets. Automatically providing this fallback by means of imputation would provide an unfair advantage to the AutoML frameworks that are not well-behaved. Moreover, many failures would not be remedied by having a random forest to fall back on, since the AutoML frameworks crash irrecoverably, e.g., due to segmentation faults.

Instead, we impute missing values with the constant predictor, or prior. This baseline returns the empirical class distribution for classification and the empirical mean for regression. This is a very penalizing imputation strategy, as the constant predictor is often much worse than results obtained by the AutoML frameworks that produce predictions for the task or fold. However, we feel this penalization for ill-behaved systems is appropriate and fairer towards the well-behaved frameworks and hope that it encourages a standard of robust, well-behaved AutoML frameworks.

5.2 Experimental Setup

We execute the experiments on commodity-level hardware with AutoML frameworks generally in their default configurations.

5.2.1 Hardware

For comparable hardware and easy expandability, we opt to conduct the benchmark on standard m5.2xlarge131313More information is available at https://aws.amazon.com/ec2/instance-types/m5/. instances available on Amazon Web Services (AWS). These represent current commodity-level hardware with  GB memory,  vCPUs (Intel Xeon Platinum series Skylake-SP processor with a sustained all core Turbo CPU clock speed of up to 3.1 GHz).  GB of gp3-SSD storage is available for storage, which can be necessary for storing a larger number of evaluated pipelines. The use of AWS also enables others to fully reproduce and extend our results, as funding permits, since the results do not depend on private computing infrastructure. As discussed in Section 4.1, the benchmark is not limited to AWS but can be run on any machine.

5.2.2 Framework Configuration

All AutoML frameworks are instantiated with their default configuration, with the following exceptions:

  • Runtime for the search. Additionally, there is one hour leeway for data loading, making predictions, and cleanup operations, but this is not communicated to the AutoML frameworks.

  • Resource constraints that specify the number of CPU cores and amount of memory available.

  • Target metric to use for optimization. This is the same metric that is used for evaluation in the benchmark.

  • ‘mode’ to declare the user intent. For example, obtaining the best possible model versus finding an interpretable (less complex) model. The mode used to evaluate each AutoML framework is chosen by their developers.

  • ‘output directory’, where any artefacts of the AutoML framework may be stored.

The experiment design intentionally prohibits further customization of other AutoML system configuration parameters to reflect how these systems are usually applied in practice as closely as possible (note that the benchmark tool does allow for this type of customization). An overview of the exact framework versions used and their ‘mode’ configurations is shown in Table 10 in the appendix.

5.3 Limitations

Both the design of the benchmark and the setup for the experiments described in this paper have some limitations with regard to the interpretation of their results. Limitations in the design stem from the desire to keep the use of the tools as close as possible to the original vision and usage intended by developers, whereas the limitations in the experiments are caused by resource constraints and may be alleviated by running additional experiments with the benchmark software. In this section, we highlight some important limitations and stress that this paper and the results within do not state which AutoML tool is ultimately the best.

5.3.1 Limitations of the Design

Perhaps the biggest limitation of the design is the inability to attribute the performance of an AutoML tool to any one aspect of its build, as is often done with ablation studies. The evaluated AutoML tools differ among multiple design choices, such as underlying ML library, search space, preprocessing, and search algorithm. Concretely, a performance difference between, e.g., AUTO-SKLEARN and TPOT could be caused by TPOT’s built-in stacking, AUTO-SKLEARN’s ensembles, the difference in Bayesian optimization versus genetic programming, the difference in how multiprocessing is employed, or a combination of these or any other difference between them. Software that would allow for such conclusions essentially requires each AutoML tool to be reimplemented on a shared set of algorithms for building models, search, and evaluation. We acknowledge that this would be incredibly valuable for the research community. However, it would also no longer resemble the software as used in practice and thus would be different work altogether. Note that it is possible to perform ablation studies with the benchmark tool for a specific AutoML framework, e.g., by comparing different framework configurations as done by Erickson et al. (2020).

Another limitation stems from only recording results produced by the final model. Anytime performance, where information about the performance during optimization is captured as if they were final models, can be very insightful. This method allows for the distinction between a tool that converges quickly from one that does not. This may be especially important for users who are interested to use the systems with a human-in-the-loop, such as when designing a search space or data features. Unfortunately, many tools do not support collection of anytime performance and— depending on how they are recorded—might interfere with resources used during search. We hope to be able to record anytime performance in the future, but in this work, we only approximate it by evaluating the tools under two different time constraints ( and hours).

Finally, the qualitative comparison of the tools is also limited. Certain quality of life features like analysis of the pipeline via interpretable ML methods, reports, usability, or support are not evaluated here but are important to many users. For a qualitative analysis of those characteristics, we refer the reader to one of the many existing overview papers on AutoML (Zöller and Huber, 2021; Truong et al., 2019). We also provide an overview of various AutoML frameworks with links to their documentation on our website141414 https://openml.github.io/automlbenchmark/frameworks.html.

5.3.2 Limitations of the Experiments

Most tools are highly configurable and allow the user to configure the search algorithm or its hyperparameters, among other aspects that affect the AutoML performance. Some tools even provide different configuration presets for different use cases, such as a performance-oriented competition mode—which we use—and a mode that produces fast or interpretable models at the cost of some performance. However, comparing the effect on model performance of tuning AutoML hyperparameters or using different presets quickly carries prohibitive costs. For this reason, we must limit our experiments to use only one mode specified by the frameworks’ developers. The most performance-oriented setting was selected as the mode for each tool. It is likely that better results may be achieved by carefully meta-tuning the AutoML tool, or that the tool with the best performance in competition mode has relatively poor performance in interpretable mode. While it is cost prohibitive for us to evaluate many different scenarios, it easy to run the benchmark with custom configurations for the various AutoML tools. This allows users to evaluate AutoML systems in a setting that reflects their interest.

5.3.3 Meta-learning

Many AutoML tools make use of meta-learning to better initialize and speed up the search (Yang et al., 2018; Feurer et al., 2015a, 2020). Since all data in the benchmark is publicly available and many of them are well known in the AutoML community, it is likely that there is a substential overlap between data used by the developers for meta-learning and the data used in the benchmark. This is a very intricate problem, as we consider AutoML tools as black boxes. Removing the effect of the data set that is to be evaluated from the meta-learning procedure is not solvable in general.

In this paper specifically, both AUTO-SKLEARN and AUTO-SKLEARN 2 use meta-learning. AUTO-SKLEARN’s meta-learning uses data sets from OpenML, each associated with well-performing ML pipelines. The search is initialized with a -nearest-data set (KND) lookup using meta-features (Reif et al., 2012). AUTO-SKLEARN can exclude data sets by name from the lookup—which we make use of in the benchmark and ensured that there is no overlap to the data sets from Gijsbers et al. (2019). Even so, it cannot be guaranteed that identical data sets with a different name might be used for meta learning out-of-the-box.

AUTO-SKLEARN 2’s meta-learning model is more complicated and consists of: a) a static pipeline portfolio for warm-starting the search, which is computed across hundreds of data sets using a greedy forward selection, and b) a meta-model to predict the internal model selection strategy and budget allocation strategy. Single data sets cannot be excluded from these meta-learning procedures, and it is not feasible to retrain the meta-models and pipeline portfolio for each data set in our benchmark. This ultimately means that the result of AUTO-SKLEARN and especially AUTO-SKLEARN 2 must be considered very carefully. More research is required to address these issues and allow for the correct evaluation of AutoML systems that use meta-learning in common benchmarks.

5.4 Overfitting the Benchmark

One last issue that plagues any widely-adopted benchmark is the potential of algorithms overfitting on the data sets used in the benchmark. Since freely available, interesting, and usable (refer to Section 5.1.1 for our selection criteria) data sets are scarce, many AutoML developers use these data sets to benchmark and then improve their systems iteratively. While this is not as direct of an issue as with meta-learning, these data sets can in general not be assumed to be truly unseen. The only practical way to avoid this is to collect a novel set of data sets for the benchmark, which would entail a prohibitive effort. Moreover, after publishing such a benchmark, the new data sets are published, which again gives developers the possibility to use them to improve their systems. On the other hand, should the benchmarking data sets be kept private to avoid this issue, the benchmark is no longer entirely reproducible by independent researchers. We hope that the size of our benchmarking suites is large enough and their design general enough that overfitting is less of an issue, but this is difficult to guarantee, and a study as outlined above may be useful to evaluate this phenomenon in the future.

6 Results

In this section, we provide an overview and analysis of the results obtained. This section is accompanied by an interactive visualization tool151515Results can be interactively explored at https://compstat-lmu.shinyapps.io/AutoML-Benchmark-Analysis/., additional information in Appendix B, and all data artifacts generated from these experiments161616Experiment artifacts can be found at: https://test.openml.org/amlb/.. For a more comprehensive comparison than what we can provide here, we strongly encourage the reader to explore the data with the interactive visualization tool. This tool enables users to select any subset of frameworks, task types, performance measures, or data characteristics iteratively and interactively. The tool is created with RSHINY (Chang et al., 2017) and includes overview plots for different task types as well as detailed visualizations for individual data sets. Moreover, a statistical analysis of the results by critical difference plots and Bradley-Terry trees are implemented.

In addition to the limitations outlined in section 5.3, we also want to stress that these experimental results are obtained by running the frameworks as they were in September of 2021. Some of these frameworks are still under active development, and results from experiments run on later versions will almost certainly differ. We strongly encourage people to run additional experiments that match their use case with up-to-date frameworks, and use the results in this section as a reference on how to analyze the results. Table 10 in Appendix C.2 shows an overview of the versions benchmarked for each framework as well as the most current version.

6.1 Performance

To report on the results for many AutoML frameworks across whole benchmarking suites, we propose using critical difference (CD) diagrams (Demšar, 2006). In a CD diagram, the average rank of each framework as well as which ranks are statistically significantly different from each other are shown. To calculate the average rank per task, we first impute any missing values with the constant predictor and then average the performance over all folds. We may then test for the presence of statistically significant differences in the average rank distributions using a non-parametric Friedman test at (here, for every diagram) and use a Nemenyi post-hoc test to find which pairs differ. For each benchmarking suite and time budget, the CD diagrams are shown in Figure 2, which displays the rank of each framework (lower is better) averaged over all results from the given benchmarking suite and budget. AUTO-SKLEARN 2 is excluded in this comparison due to the meta-learning issues discussed in Section 5.3.3 and because its inclusion would affect the critical difference as well as the rank of other frameworks.

(a) Binary Classification, 1 hour
(b) Binary Classification, 4 hours
(c) Multi-class Classification, 1 hour
(d) Multi-class Classification, 4 hours
(e) Regression, 1 hour
(f) Regression, 4 hours
Figure 2: CD plots with Nimenyi post-hoc test after imputing missing values with the constant predictor baseline.

Overall, we observe that AUTOGLUON and TPOT respectively achieve the best and worst rank among AutoML frameworks in each setting with respect to model accuracy, although never by a statistically significant margin. In almost all cases, the baselines obtain lower ranks than any AutoML framework, although the tuned random forest is a strong baseline that is often not significantly worse than many of the AutoML frameworks. All AutoML frameworks except AUTOGLUON and TPOT are generally ranked close to each other, with small differences across the various suites and budgets.

Figure 3:

Boxplots of framework performance across tasks after scaling the performance values from random forest (-1) to best observed (0). The number of outliers for each framework that are not shown in the plot are denoted on the x-axis.

To complement the CD diagrams—which obfuscate the relative performance differences—we show box plots of obtained results (after imputation) across all tasks in Figure 3. Because the performances are not commensurable across tasks, we first scale all results per task between the random forest performance (-1) and the best observed performance (0) (that is, higher scores correspond to better performance). This also makes the scaled value interpretable. Furthermore, the value scales based on the improvement over the baseline that is observed to be achievable. While the boxplots are calculated over performance data on all tasks, the plots are cut off to allow a better visualization of the most relevant area. The number of outliers for each framework that are not shown in the plot are denoted on the x-axis.

Even if ranks are similar, the performance distribution might be noticeably different. For example, GAMA, H2O AUTOML and LIGHTAUTOML achieve very similar average ranks on the one hour binary classification tasks. However, we observe from the boxplots that while H2O AUTOML achieves lower median normalized performance in this segment, its worst observed performances are much better than that of GAMA and LIGHTAUTOML. Similarly, while TPOT’s average rank is generally close to that of the Tuned Random Forest baseline, TPOT exhibits much higher variance in its prediction quality.

Figure 4: Bradley-Terry tree of depth three for classification tasks. Results from the one hour classification benchmark were used, and missing values were imputed by constant predictor performance. One observation within the BT tree equals the preference ranking of one fold on one data set.

6.2 Bradley-Terry Trees

Bradley-Terry (BT) trees (Strobl et al., 2011) can be used to statistically analyse benchmark experiments based on data set characteristics (Eugster et al., 2014). These trees use data set characteristics—such as the number of instances, the number of features, the ratio of missing values, and others—to split paired performance comparisons of the framework to find statistically significant differences in performance. Bradley-Terry models originate in psychology to analyze paired comparison experiments of subjects preferring one stimulus over another. For our benchmark, such a preference ranking can be easily derived by pairwise performance comparisons of all frameworks with regard to the data sets and cross-validation folds.

The underlying algorithm of model-based recursive partitioning of Bradley–Terry models works as follows: In each split of the BT tree, a BT model is fitted for the paired comparisons based on the underlying data set characteristics. Following Zeileis and Hornik (2007) and Eugster et al. (2014), the BT model performs a statistical test of parameter instability for the chosen data characteristics. If this test reveals a significant instability in the model parameters, the corresponding tree node splits the data according the the characteristic yielding the highest instability (lowest test p-value). The splitting cut-point is then determined such that it has the highest improvement of the model fit. This procedure is repeated until either no significant instability is left, a set tree depth is reached, or further splits would exceed a set minimum number of observations in the leaves.

Numeric values in the tree leafs are worth parameters that can be interpreted as preferences for the different frameworks (Eugster et al., 2014). Since these values are in and sum up to

within a leaf, they can be understood as the probability of a framework performing best, given the data characteristics in the corresponding leaf.

Figure 4 shows a Bradley-Terry tree for classification tasks for a runtime of one hour. For simplicity, in order to obtain an easily understood tree, only the number of instances/features and the imbalance ratio was chosen as data characteristics. The first split distinguishes between data sets with more that instances and those equal to or below that cut-point. Following the left child node, the imbalance ratio of was chosen to define the two left tree leafs. The left one (Node 3)—small and very balanced classification data sets—indicates that in such situations, GAMA is preferred over all other frameworks. Even though AUTOGLUON is less preferred than GAMA for those kinds of data sets, AUTOGLUON is still preferable to all other frameworks. On small, more imbalanced data sets (Node 4), AUTOGLUON is preferable to all other frameworks, followed by AUTO-SKLEARN 2.

The right half of the tree is again divided into medium and large data sets at a splitting value of observations. While on the left leaf (Node 6) AUTOGLUON is clearly the preferred framework, the same applies on large classification data sets to FLAML (Node 7), followed by MLJAR and AUTOGLUON.

Simpler Bradley-Terry trees with only the number of instances and features as data set characteristics can be found in the appendix. Note that the findings from the BT trees are essentially the same as those from Section 6.1, as AUTOGLUON is overall the preferred framework in most tree leafs. Moreover, the reader is strongly invited to explore the aforementioned interactive visualization tool, with which deeper BT trees based on several more data set characteristics can be constructed on various task types.

6.3 Model Accuracy vs. Inference Time Trade-offs

Model accuracy plays a central role in evaluating performance of machine learning models. However, maximizing accuracy can come at the cost of added model complexity. One practical way to consider the complexity of the model is to measure the inference speed of the resulting model.

Some of the integrated frameworks offer a “compete” mode (AUTOGLUON, MLJAR, LIGHTAUTOML and GAMA) that maximizes accuracy, typically at the cost of increased model complexity, similar to competing in a Kaggle competition. This can lead to models being built that are highly accurate but are extremely slow at inference time and are therefore not practical in many real-life use-cases.

Fortunately, some frameworks provide multiple presets that allow the user to balance of the trade-off between accuracy and inference time differently. Results in this section used presets which prioritize accuracy over inference time, and performing additional experiments with other presets is advised when inference time is important171717We plan to do additional experimental evaluations which may include more balanced presets later this year and publish the results on our website.. On the other hand, we also recognize that there are applications for which inference time is not important.

In order to evaluate the limitations of the models produced by each framework (in “compete” mode), we also measured “prediction duration,” or how long was needed to produce predictions for the test set for each data set in the benchmark. This metric provides important insight into the trade-offs that tool authors make in their algorithm designs.

Figure 5 shows aggregated inference times across all models, including total time to score the test set (predict duration) as well as the per-row prediction speed (predict duration divided by the number of rows in each test set). Outliers have been removed from both plots for visibility, as there are a handful of very extreme outliers. Both AUTOGLUON and MLJAR stand out as orders of magnitude slower than the other AutoML tools, on average.181818It is important to note that AUTOGLUON’s most recent version, which was released after the experiments were performed, now has the option to explicitly constrain inference time and provide more presets that balance inference time and model accuracy. LIGHTAUTOML is also slower than the remaining tools. GAMA is approximately as fast as AUTO-SKLEARN and AUTO-SKLEARN 2, which can be explained by the fact that they both build optimized SCIKIT-LEARN pipelines and create an ensemble using the same algorithm (Caruana et al., 2004, 2006). H2O AUTOML, FLAML and TPOT stand out as having very fast inference times, although TPOT is much less accurate than the other tools, as seen in more detail below.

Figure 5: Prediction duration (on the test set) in seconds and prediction speed (per row) in seconds as divided by the total number of rows in the test set, aggregated across all runs (tasks, problem types, and constraints).

In Figure 12, we show the Pareto front for all six scenarios, demonstrating the average normalized model accuracy against corresponding average per-row prediction speeds (Appendix B.2 contain plots with median per-row prediction speeds instead). Here, it is more apparent that the frameworks that achieve the highest accuracy do so at the cost of inference time performance. This demonstrates that when contextualizing any type of model accuracy results, it is important to consider any trade-offs that may have been made to achieve the extra performance and how that will affect the framework’s usability in practice. In this case, measuring accuracy in isolation does not give the complete picture of the overall utility of a particular framework.

It should also be noted that with sufficient computing infrastructure and effort, scoring across rows or chunks of data could be parallelized in a production system, which would reduce the overall prediction time as compared to predicting a single test set on a single machine. However, our goal in this section was to compare the inference time of the high-accuracy models derived from different AutoML frameworks to each each other rather than evaluate different techniques for speeding up the inference of any individual system.

Figure 6: Pareto Frontiers of framework performance across tasks after scaling the performance values from the worst framework (0) to best observed (1).

6.4 Observed AutoML Failures

While most jobs completed successfully, we observed multiple framework errors during our experiments. In this section, we will discuss where AutoML frameworks fail, although we want to stress that development for these packages is ongoing. For that reason, it is likely that the same frameworks will not experience the same failures in the future (especially after gaining access to all experiment logs). We categorize the errors into the following categories:

  • Memory: The framework crashed due to exceeding available memory or encountering other memory-related errors, such as segmentation faults.

  • Time: The framework exceeded the time limit past the leniency period.

  • Data: Errors due to specific data characteristics (such as imbalanced data) occurred.

  • Implementation: Any errors caused by bugs in the AutoML framework code occurred.

These categories are a bit crude and ultimately subjective, since from a reductive viewpoint, all errors are implementation errors. However, they serve for a quick overview and a more detailed discussion can be found in Appendix D.

Figure 7 shows the errors by type on the left and by task on the right. Overall, memory and time constraints are the main cause for errors, with one major exception191919 MLJAR has 190 ‘implementation errors’ which are caused by only two distinct index errors. . We observe that errors are far more common in the classification benchmark suite than the regression suite. This is largely accounted for by the difference in benchmarking suite size (33 and 71 tasks) and the fact that the largest data sets are mostly classification, both in number of instances and features. Unique to classification, we do observe several frameworks failing to produce models or predictions on highly imbalanced data sets. This is also the case for the failures on the two small classification data sets (‘yeast’ and ‘wine-quality-white’), where careless use of internal validation splits no longer contain all classes. Interestingly, AutoML frameworks fail more frequently on a larger time budget. Both memory and time constraint violations happen more frequently, which may potentially be explained by frameworks saving increasingly more models or building increasingly larger pipelines.

Figure 7: For each framework, errors by type are shown on the left, and errors by task are shown on the right.

Only when the framework exceeds the time budget by more than one hour do we record a time error. However, as we can see in Figure 8, not all AutoML frameworks adhere to the runtime constraints equally well, even if they finish within the leniency period. In the figure, the training duration for each individual job (task and fold combination) are aggregated, and timeout errors are shown above each framework, where missing values due to non-time errors are excluded. These plots reveal different design decisions around the specified runtime, with some frameworks never exceeding the limit by more than a few minutes, while others violate it by a larger margin with some regularity. Interestingly, we see that a number of frameworks consistently tend to stop far before the specified runtime limit.

Figure 8: Time spent during search with a one hour budget (left) and four hour budget (right). The grey line indicates the specified time limit, and the red line denotes the end of the leniency period. The number of timeout errors for each framework are shown beside it.

7 Conclusion

We presented a novel benchmark for measuring and comparing AutoML frameworks. To ensure reproducibility, fair comparison, and detailed analysis of the results, our open-source benchmarking tool automates the empirical evaluation of any integrated framework on any supported task, including installation of the AutoML framework, provisioning the data for training and inference, resource allocation, and processing of the results. This greatly simplifies evaluating AutoML frameworks while enhancing reproducibility and reducing errors. We worked jointly with the authors of 9 AutoML frameworks to evaluate their systems in a large-scale study on 71 classification and 33 tasks.

When analyzing the predictive performance of these AutoML systems, we find that the average ranks of the AutoML frameworks are generally not statistically significantly different from each other. Still, by using Bradley-Terry trees (Strobl et al., 2011), we find that their relative performance is affected by data characteristics—such as the data set size, dimensionality, and class balance—highlighting specific strengths and weaknesses. Overall, in terms of model performance, AUTOGLUON consistently has the highest average rank in our benchmark. Additionally, in most scenarios, the AutoML frameworks outperform even our strongest baseline.

Because inference time is an important factor in real-world applications, we also reviewed the inference time and accuracy trade-off and found large differences in inference time of the produced models, at times spanning multiple orders of magnitude. The most accurate frameworks achieve higher model accuracy at a large cost to performance in terms of inference speed. Broadly speaking, while the models with higher accuracy also have slower inference time, not all frameworks produce models that are Pareto optimal.

Finally, we analyzed scenarios in which AutoML frameworks fail to produce a model and found that the main cause for failure was data set size. In other words, not all methods scale well. To allow further analysis of our results, we provide an open-source interactive visualization tool, which includes graphical representations and statistical tests.

7.1 Limitations

These quantitative results are obtained by using AutoML frameworks in their default configuration, optionally with a ‘benchmarking preset’, to reflect the out-of-the-box nature of AutoML frameworks. The performance of frameworks under non-default settings can be very relevant when, for example, there is budget available to also optimize AutoML hyperparameters, or when a custom search space is used.

Next, all frameworks differ along multiple design axes, which prohibits attributing performance differences to any specific component of the AutoML framework (such as the search algorithm), without additional analysis.

Moreover, since we provide a purely quantitative comparison, it ignores qualitative aspects of AutoML frameworks that are very relevant in real-world settings, such as the produced model’s interpretability or the level of support. Additionally, most of the evaluated AutoML frameworks are under active development and their performance subject to change.

Lastly, the benchmarks were executed on 8-CPU machines, which is very modest by today’s standards. Therefore, frameworks with better parallelism may show even greater advantages on higher powered machines (with, for example, 50 or 100 CPUs).

7.2 Future Work

The benchmarking suites, or sets of tasks, proposed in this paper are meant as a starting point to be improved upon in collaboration with the AutoML community. We will seek to update these suites based on community discussion so that they continue to reflect modern challenges while also decreasing the risk of AutoML frameworks overfitting to the benchmark. In particular, we are interested in extending the benchmarking suites with problems that feature free-form text data. Real-world data often contain instances of text data with different semantic meaning, such as addresses or URLs, from which meaningful features can be extracted. This kind of feature engineering is typically very important for model performance on such data sets, but the current selection of data sets does not reflect this (in part because not all AutoML frameworks support text features).

We would also like to extend the benchmarking tool to support new problems, such as multi-objective optimization tasks. While model accuracy is often an important metric, ‘secondary’ metrics—such as a model’s inference time or fairness—are often crucial for real-world applications. Multi-objective optimization can be used to convey the importance of these metrics to AutoML frameworks (for example, by using a fairness metric as secondary objective, Schmucker et al., 2021

), which can subsequently provide Pareto fronts of models that optimize this trade-off. In particular for fairness related tasks, additional support to convey sensitive attributes and protected groups to the AutoML frameworks must be added. Other interesting problem types include non-i.i.d. data, such as when temporal relationships are present in the data, or semi-supervised learning, where not all instances have an associated ground truth.

Finally, the evaluations in this paper were performed on the equivalent of commodity-level hardware using only CPU and a limited time budget. In some cases, it is more desirable to devote a large budget to building a single model, perhaps even days of compute, potentially with GPU access. As different behavior (both in robustness and model performance) on one-hour and four-hour budgets are already observed, future work may reveal different behavior when evaluating AutoML frameworks at different scales.

7.3 Parting Words

The benchmark tool presented in this work makes producing rigorous reproducible research both easier and faster. We hope that the open and extensible nature of this benchmark motivates researchers to not only use the tool, but also to contribute their own data sets, framework integrations, or feedback to the open source AutoML benchmark. We strongly encourage this participation so that the benchmark may remain useful to the community for a long time to come.

We would all like to give special thanks to everyone that contributed to the benchmark, both directly with pull requests and indirectly through opening issues. We also thank Rinchin Damdinov, Nick Erickson, Matthias Feurer, and Piotr Płoński for feedback and corrections to this manuscript.

This work made use of the resources and expertise offered by the SURF Public Cloud Call which is financed by the Dutch Research Council (NWO). We also made use of research credits provided by the AWS Cloud Credit for Research program.

Pieter Gijsbers and Joaquin Vanschoren would like to acknowledge funding by AFRL and DARPA under contract FA8750-17-C-0141, and EU’s Horizon 2020 research and innovation program under grant agreement No. 952215 (TAILOR).

Stefan Coors, Janek Thomas, and Bernd Bischl would like to acknowledge funding by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A.

Appendix A OpenML Benchmark Suites

Table 2 and Table 3 contain an overview of data sets used in the regression and classification benchmarking suites, respectively. We hope to continuously update the benchmarking suites with new data sets that represent current challenges.

Task ID name n p
359944 abalone 4177 9
359929 Airlines_DepDelay_10M 10000000 10
233212 Allstate_Claims_Severity 188318 131
359937 black_friday 166821 10
359950 boston 506 14
359938 Brazilian_houses 10692 13
233213 Buzzinsocialmedia_Twitter 583250 78
359942 colleges 7063 45
233211 diamonds 53940 10
359936 elevators 16599 19
359952 house_16H 22784 17
359951 house_prices_nominal 1460 80
359949 house_sales 21613 22
233215 Mercedes_Benz_Greener_Manufacturing 4209 377
360945 MIP-2016-regression 1090 145
167210 Moneyball 1232 15
359943 nyc-taxi-green-dec-2016 581835 19
359941 OnlineNewsPopularity 39644 60
359946 pol 15000 49
360933 QSAR-TID-10980 5766 1026
360932 QSAR-TID-11 5742 1026
359930 quake 2178 4
233214 Santander_transaction_value 4459 4992
359948 SAT11-HAND-runtime-regression 4440 117
359931 sensory 576 12
359932 socmob 1156 6
359933 space_ga 3107 7
359934 tecator 240 125
359939 topo_2_1 8885 267
359945 us_crime 1994 127
359935 wine_quality 6497 12
317614 Yolanda 400000 101
359940 yprop_4_1 8885 252
Table 2: Tasks in the AutoML regression suite (continued).
Task ID name n p C class ratio
190411 ada 4147 49 2 0.33
359983 adult 48842 15 2 0.31
189354 airlines 539383 8 2 0.80
189356 albert 425240 79 2 1.00
10090 amazon-commerce-reviews 1500 10001 50 1.00
359979 Amazon_employee_access 32769 10 2 0.06
168868 APSFailure 76000 171 2 0.02
190412 arcene 100 10001 2 0.79
146818 Australian 690 15 2 0.80
359982 bank-marketing 45211 17 2 0.13
359967 Bioresponse 3751 1777 2 0.84
359955 blood-transfusion-service-center 748 5 2 0.31
359960 car 1728 7 4 0.05
359973 christine 5418 1637 2 1.00
359968 churn 5000 21 2 0.16
359992 Click_prediction_small 39948 12 2 0.20
359959 cmc 1473 10 3 0.53
359957 cnae-9 1080 857 9 1.00
359977 connect-4 67557 43 3 0.15
7593 covertype 581012 55 7 0.01
168757 credit-g 1000 21 2 0.43
211986 Diabetes130US 101766 50 3 0.21
168909 dilbert 10000 2001 5 0.93
189355 dionis 416188 61 355 0.36
359964 dna 3186 181 3 0.46
359954 eucalyptus 736 20 5 0.49
168910 fabert 8237 801 7 0.26
359976

Fashion-MNIST

70000 785 10 1.00
359969 first-order-theorem-proving 6118 52 6 0.19
359970 GesturePhaseSegmentationProcessed 9873 33 5 0.34
189922 gina 3153 971 2 0.97
359988 guillermo 20000 4297 2 0.67
359984 helena 65196 28 100 0.03
360114 Higgs 1000000 29 2 0.89
359966 Internet-Advertisements 3279 1559 2 0.16
211979 jannis 83733 55 4 0.04
168911 jasmine 2984 145 2 1.00
359981 jungle_chess_2pcs_raw_endgame_complete 44819 7 3 0.19
359962 kc1 2109 22 2 0.18
360975 KDDCup09-Upselling 50000 14892 2 0.08
3945 KDDCup09_appetency 50000 231 2 0.02
360112 KDDCup99 4898431 42 23 0.00
359991 kick 72983 33 2 0.14
359965 kr-vs-kp 3196 37 2 0.91
190392 madeline 3140 260 2 0.99
359961 mfeat-factors 2000 217 10 1.00
359953 micro-mass 571 1301 20 0.18
359990 MiniBooNE 130064 51 2 0.39
359980 nomao 34465 119 2 0.40
167120 numerai28.6 96320 22 2 0.98
359993 okcupid-stem 50789 20 3 0.13
190137 ozone-level-8hr 2534 73 2 0.07
359958 pc4 1458 38 2 0.14
190410 philippine 5832 309 2 1.00
359971 PhishingWebsites 11055 31 2 0.80
168350 phoneme 5404 6 2 0.42
360113 porto-seguro 595212 58 2 0.04
359956 qsar-biodeg 1055 42 2 0.51
359989 riccardo 20000 4297 2 0.33
359986 robert 10000 7201 10 0.92
359975 Satellite 5100 37 2 0.01
359963 segment 2310 20 7 1.00
359994 sf-police-incidents 2215023 9 2 0.14
359987 shuttle 58000 10 7 0.00
168784 steel-plates-fault 1941 28 7 0.08
359972 sylvine 5124 21 2 1.00
190146 vehicle 846 19 4 0.91
359985 volkert 58310 181 10 0.11
146820 wilt 4839 6 2 0.06
359974 wine-quality-white 4898 12 7 0.00
2073 yeast 1484 9 10 0.01
Table 3: Tasks in the AutoML classification suite (continued).

Appendix B Results

This appendix contains additional figures and tables with experimental results. Tables 4-9

report results per framework per task for different task types and time budgets. Each value denotes the mean score of completed folds, the standard deviation in those completed folds, and the number of folds for which the AutoML framework did not return a result, if applicable. A ‘-’ denotes cases where the AutoML framework was unable to complete any fold of a task on a specific time budget. The baseline performances are omitted to allow for single-page tables, but are available in published experimental result files.

framework AUTOGLUON AUTO-SKLEARN AUTO-SKLEARN 2 FLAML GAMA H2O AUTOML LIGHT AUTOML MLJAR TPOT
task id task name
146818 australi… 0.941(0.019) 0.932(0.020) 0.937(0.020) 0.942(0.017) 0.941(0.019) 0.934(0.024) 0.944(0.022) 0.945(0.018) 0.937(0.021)
146820 wilt 0.994(0.010) 0.997(0.003) 0.994(0.008) 0.992(0.012) 0.995(0.005) 0.994(0.007) 0.994(0.007) 0.996(0.004) 0.994(0.007)
167120 numerai2… 0.525(0.005) 0.529(0.005) 0.530(0.004) 0.528(0.004) 0.531(0.004) 0.532(0.005) 0.531(0.005) 0.530(0.004) 0.527(0.006)
168350 phoneme 0.972(0.008) 0.965(0.009) 0.970(0.008) 0.972(0.010) 0.971(0.009) 0.967(0.009) 0.965(0.008) 0.976(0.004) 0.969(0.010)
168757 credit-g 0.791(0.042) 0.777(0.049) 0.797(0.037) 0.780(0.045) 0.794(0.028) 0.798(0.033) 0.791(0.037) - 0.791(0.039)
168868 apsfailu… 0.992(0.003) 0.992(0.003) 0.992(0.002) 0.991(0.003) 0.990(0.003) 0.992(0.002) 0.992(0.002) 0.993(0.002) 0.990(0.003)
168911 jasmine 0.887(0.017) 0.884(0.017) 0.887(0.016) 0.885(0.018) 0.894(0.016) 0.884(0.018) 0.880(0.018) 0.887(0.016) 0.886(0.015)
189354 airlines 0.728(0.002) 0.726(0.002) 0.728(0.002) 0.731(0.002) 0.724(0.005) 0.731(0.002) 0.728(0.002) 0.730(0.002) 0.724(0.002)
189356 albert 0.781(0.003) 0.760(0.004) 0.756(0.003) 0.778(0.004) 0.716(0.012) 0.763(0.003) 0.782(0.002) 0.783(0.002) 0.725(0.015)
189922 gina 0.992(0.005) 0.989(0.007) 0.987(0.007) 0.992(0.005) 0.992(0.003) 0.990(0.006) 0.990(0.006) 0.991(0.005) 0.988(0.007)
190137 ozone-le… 0.935(0.018) 0.920(0.024) 0.928(0.024) 0.925(0.025) 0.924(0.030) 0.929(0.018) 0.931(0.016) 0.930(0.022) 0.913(0.037)
190392 madeline 0.945(0.009) 0.969(0.005) 0.944(0.008) 0.952(0.011) 0.956(0.007) 0.944(0.011) 0.935(0.008) 0.953(0.011) 0.948(0.007)
190410 philippi… 0.883(0.013) 0.911(0.013) 0.875(0.014) 0.893(0.013) 0.895(0.012) 0.878(0.012) 0.864(0.015) 0.881(0.013) 0.888(0.014)
190411 ada 0.920(0.018) 0.919(0.017) 0.919(0.018) 0.924(0.017) 0.920(0.018) 0.920(0.018) 0.922(0.018) 0.920(0.017) 0.918(0.018)
190412 arcene 0.861(0.170) 0.865(0.181) 0.845(0.202) 0.848(0.204) 0.860(0.162) 0.853(0.199) 0.865(0.133) 0.837(0.206) 0.853(0.133)
359955 blood-tr… 0.752(0.045) 0.758(0.045) 0.759(0.044) 0.734(0.053) 0.756(0.048) 0.763(0.040) 0.748(0.054) - 0.757(0.055)
359956 qsar-bio… 0.942(0.034) 0.926(0.033) 0.939(0.027) 0.930(0.035) 0.936(0.032) 0.936(0.034) 0.933(0.033) 0.931(0.036) 0.933(0.032)
359958 pc4 0.951(0.018) 0.943(0.018) 0.948(0.017) 0.948(0.021) 0.952(0.021) 0.947(0.026) 0.948(0.016) 0.951(0.017) 0.951(0.018)
359962 kc1 0.839(0.034) 0.842(0.028) 0.838(0.035) 0.842(0.035) 0.849(0.030) 0.823(0.042) 0.827(0.034) 0.828(0.033) 0.853(0.036)
359965 kr-vs-kp 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.001) 1.000(0.000) 0.999(0.001) 0.999(0.001)
359966 internet… 0.987(0.009) 0.985(0.013) 0.986(0.012) 0.985(0.011) 0.985(0.010) 0.986(0.009) 0.987(0.010) 0.986(0.012) 0.984(0.010)
359967 biorespo… 0.886(0.016) 0.872(0.017) 0.872(0.018) 0.886(0.016) 0.884(0.017) 0.886(0.017) 0.884(0.017) 0.884(0.018) 0.879(0.017)
359968 churn 0.928(0.024) 0.922(0.022) 0.920(0.026) 0.920(0.018) 0.923(0.023) 0.922(0.027) 0.925(0.021) 0.932(0.025) 0.918(0.021)
359971 phishing… 0.998(0.001) 0.997(0.001) 0.997(0.001) 0.998(0.001) 0.997(0.001) 0.998(0.001) 0.998(0.001) 0.998(0.001) 0.948(0.157)
359972 sylvine 0.991(0.003) 0.990(0.004) 0.989(0.002) 0.991(0.002) 0.993(0.002) 0.990(0.003) 0.988(0.003) 0.993(0.003) 0.991(0.003)
359973 christine 0.828(0.012) 0.831(0.015) 0.817(0.012) 0.822(0.013) 0.829(0.018) 0.824(0.011) 0.832(0.013) 0.825(0.014) 0.794(0.048)
359975 satellite 0.997(0.004) 0.993(0.007) 0.995(0.008) 0.993(0.007) 0.997(0.003) 0.994(0.006) 0.985(0.026) 0.994(0.007) 0.997(0.002)
359979 amazon_e… 0.899(0.012) 0.853(0.015) 0.876(0.012) 0.901(0.012) 0.863(0.013) 0.879(0.010) 0.902(0.011) 0.905(0.013) 0.865(0.012)
359980 nomao 0.997(0.001) 0.996(0.001) 0.996(0.001) 0.997(0.001) 0.996(0.001) 0.996(0.001) 0.998(0.001) 0.996(0.001) 0.993(0.008)
359982 bank-mar… 0.941(0.006) 0.939(0.006) 0.939(0.007) 0.938(0.007) 0.937(0.006) 0.939(0.007) 0.940(0.005) 0.940(0.004) 0.933(0.007)
359983 adult 0.932(0.003) 0.930(0.004) 0.931(0.004) 0.931(0.004) 0.929(0.004) 0.931(0.004) 0.933(0.004) - 0.927(0.005)
359988 guillermo 0.925(0.007) 0.913(0.008) 0.904(0.008) 0.921(0.010) 0.875(0.026) 0.902(0.010) 0.939(0.006) 0.916(0.009) 0.811(0.032)
359989 riccardo 1.000(0.000) 1.000(0.000) 1.000(0.000) 0.999(0.002) 0.999(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 0.998(0.001)
359990 miniboone 0.989(0.001) 0.987(0.001) 0.988(0.001) 0.987(0.001) 0.982(0.002) 0.987(0.001) 0.988(0.001) 0.987(0.001) 0.981(0.002)
359991 kick 0.788(0.007) 0.786(0.006) 0.785(0.006) 0.787(0.007) 0.786(0.007) 0.788(0.007) 0.783(0.007) 0.752(0.009) 0.728(0.012)
359992 click_pr… 0.700(0.010) 0.700(0.012) 0.705(0.012) 0.722(0.009) 0.660(0.016) 0.702(0.014) 0.728(0.010) 0.709(0.014) 0.716(0.012)
359994 sf-polic… 0.698(0.001) 0.695(0.006) 0.705(0.003) 0.712(0.008) 0.623(0.009) 0.700(0.002) 0.688(0.002) 0.706(0.009) 0.654(0.020)
360113 porto-se… 0.641(0.004) 0.638(0.005) 0.638(0.004) 0.642(0.004) 0.624(0.007) 0.642(0.004) 0.641(0.004) 0.642(0.004) 0.626(0.007)
360114 higgs 0.839(0.001) 0.832(0.003) 0.835(0.002) 0.840(0.001) 0.786(0.008) 0.832(0.001) 0.838(0.001) 0.836(0.001) 0.773(0.003)
360975 kddcup09… 0.902(0.007) 0.892(0.009) 0.893(0.009) - - 0.898(0.008) 0.910(0.007) 0.907(0.006) -
3945 kddcup09… 0.846(0.013) 0.838(0.015) 0.841(0.015) 0.829(0.010) 0.830(0.014) 0.838(0.015) 0.851(0.008) 0.841(0.016) 0.823(0.015)
Table 4: Results for binary classification (in AUC) on a one hour budget, denoted as mean(std).
framework AUTOGLUON AUTO-SKLEARN AUTO-SKLEARN 2 FLAML GAMA H2O AUTOML LIGHT AUTOML MLJAR TPOT
task id task name
146818 australi… 0.940(0.020) 0.932(0.019) 0.940(0.020) 0.939(0.025) 0.940(0.019) 0.934(0.020) 0.944(0.021) 0.940(0.024) 0.936(0.024)
146820 wilt 0.994(0.009) 0.994(0.010) 0.995(0.008) 0.988(0.013) 0.996(0.004) 0.993(0.009) 0.994(0.007) 0.994(0.003) 0.985(0.025)
167120 numerai2… 0.524(0.005) 0.530(0.005) 0.531(0.004) 0.528(0.005) 0.532(0.004) 0.531(0.004) 0.531(0.005) 0.530(0.004) 0.527(0.006)
168350 phoneme 0.972(0.008) 0.964(0.008) 0.970(0.009) 0.972(0.009) 0.971(0.009) 0.967(0.010) 0.966(0.008) - 0.971(0.009)
168757 credit-g 0.791(0.039) 0.783(0.042) 0.795(0.038) 0.784(0.039) 0.791(0.030) 0.782(0.043) 0.788(0.035) - 0.787(0.034)
168868 apsfailu… 0.992(0.002) 0.992(0.002) 0.992(0.003) 0.992(0.003) 0.992(0.002) 0.992(0.002) 0.994(nan) 0.993(0.002) 0.989(0.003)
168911 jasmine 0.887(0.018) 0.882(0.014) 0.887(0.017) 0.888(0.016) 0.893(0.014) 0.882(0.020) 0.881(0.018) 0.891(0.016) 0.889(0.012)
189354 airlines 0.730(0.002) 0.728(0.002) 0.727(0.002) 0.731(0.002) - 0.733(0.002) 0.730(0.002) 0.732(0.002) 0.724(0.002)
189356 albert 0.781(0.002) 0.764(0.004) 0.759(0.002) 0.776(0.005) 0.747(0.009) 0.769(0.002) 0.782(0.002) 0.785(0.002) 0.734(0.009)
189922 gina 0.992(0.005) 0.994(0.003) 0.988(0.007) 0.991(0.005) 0.991(0.005) 0.990(0.005) 0.990(0.006) 0.993(0.004) 0.991(0.005)
190137 ozone-le… 0.934(0.017) 0.920(0.024) 0.933(0.022) 0.925(0.021) 0.926(0.032) 0.930(0.016) 0.930(0.016) 0.911(0.019) 0.916(0.026)
190392 madeline 0.946(0.009) 0.968(0.006) 0.945(0.008) 0.954(0.007) 0.959(0.008) 0.948(0.011) 0.935(0.009) 0.963(0.008) 0.954(0.007)
190410 philippi… 0.884(0.013) 0.917(0.013) 0.877(0.014) 0.893(0.013) 0.903(0.014) 0.878(0.013) 0.865(0.015) 0.905(0.010) 0.897(0.013)
190411 ada 0.920(0.018) 0.917(0.017) 0.920(0.018) 0.924(0.018) 0.921(0.018) 0.921(0.017) 0.922(0.018) 0.921(0.018) 0.917(0.018)
190412 arcene 0.857(0.175) 0.861(0.140) 0.832(0.156) 0.843(0.200) 0.856(0.162) 0.844(0.170) 0.857(0.176) 0.864(0.156) 0.840(0.135)
359955 blood-tr… 0.755(0.044) 0.745(0.052) 0.755(0.040) 0.731(0.066) 0.757(0.049) 0.760(0.029) 0.749(0.055) - 0.754(0.043)
359956 qsar-bio… 0.941(0.035) 0.929(0.036) 0.937(0.027) 0.928(0.033) 0.937(0.032) 0.937(0.037) 0.933(0.033) 0.926(nan) 0.933(0.031)
359958 pc4 0.951(0.018) 0.941(0.020) 0.949(0.017) 0.949(0.019) 0.951(0.019) 0.945(0.022) 0.950(0.016) 0.951(0.017) 0.943(0.023)
359962 kc1 0.839(0.033) 0.843(0.031) 0.839(0.036) 0.840(0.035) 0.851(0.032) 0.831(0.029) 0.828(0.032) 0.829(0.032) 0.844(0.036)
359965 kr-vs-kp 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 0.950(0.158)
359966 internet… 0.987(0.011) 0.983(0.014) 0.982(0.015) 0.986(0.008) 0.984(0.011) 0.988(0.009) 0.987(0.010) 0.991(nan) 0.982(0.011)
359967 biorespo… 0.887(0.017) 0.871(0.018) 0.873(0.018) 0.886(0.017) 0.885(0.017) 0.889(0.015) 0.883(0.016) 0.885(0.018) 0.880(0.017)
359968 churn 0.929(0.023) 0.920(0.022) 0.919(0.021) 0.921(0.020) 0.921(0.022) 0.926(0.020) 0.926(0.022) 0.931(0.024) 0.919(0.022)
359971 phishing… 0.997(0.001) 0.997(0.001) 0.997(0.001) 0.998(0.001) 0.998(0.001) 0.998(0.001) 0.998(0.001) - 0.849(0.240)
359972 sylvine 0.991(0.003) 0.992(0.004) 0.990(0.002) 0.991(0.002) 0.993(0.002) 0.991(0.004) 0.988(0.003) 0.993(0.003) 0.995(0.001)
359973 christine 0.829(0.013) 0.829(0.017) 0.818(0.013) 0.826(0.012) 0.833(0.014) 0.824(0.013) 0.832(0.013) 0.829(0.012) 0.816(0.013)
359975 satellite 0.997(0.003) 0.979(0.047) 0.995(0.005) 0.981(0.030) 0.996(0.002) 0.991(0.010) 0.985(0.024) 0.989(0.015) 0.990(0.023)
359979 amazon_e… 0.895(0.012) 0.862(0.015) 0.878(0.010) 0.901(0.012) 0.862(0.013) 0.877(0.013) 0.903(0.011) 0.904(0.012) 0.866(0.013)
359980 nomao 0.997(0.001) 0.996(0.001) 0.997(0.001) 0.997(0.001) 0.996(0.001) 0.996(0.001) 0.998(0.001) 0.997(0.001) 0.996(0.001)
359982 bank-mar… 0.942(0.006) 0.938(0.006) 0.939(0.007) 0.938(0.007) 0.937(0.007) 0.939(0.007) 0.940(nan) 0.943(0.005) 0.935(0.007)
359983 adult 0.932(0.004) 0.930(0.004) 0.931(0.004) 0.932(0.004) 0.930(0.004) 0.931(0.004) 0.933(0.004) - 0.928(0.004)
359988 guillermo 0.930(0.006) 0.913(0.008) 0.907(0.009) - 0.908(0.010) 0.911(0.008) 0.940(0.007) 0.917(0.007) 0.859(0.048)
359989 riccardo 1.000(0.000) 1.000(0.000) 1.000(0.000) 0.999(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 0.997(0.004)
359990 miniboone 0.989(0.001) 0.987(0.001) 0.988(0.001) 0.987(0.001) 0.985(0.001) 0.987(0.001) 0.988(0.001) 0.988(0.001) 0.983(0.001)
359991 kick 0.787(0.007) 0.790(0.007) 0.786(0.007) 0.788(0.006) 0.788(0.006) 0.788(0.007) 0.784(0.007) 0.757(0.012) 0.742(0.006)
359992 click_pr… 0.698(0.009) 0.698(0.014) 0.703(0.012) 0.723(0.009) 0.660(0.015) 0.704(0.012) 0.728(0.009) - 0.719(0.010)
359994 sf-polic… 0.725(0.002) 0.708(0.003) 0.706(0.002) 0.713(0.007) 0.648(0.012) 0.707(0.004) 0.690(0.002) 0.708(0.003) 0.670(0.010)
360113 porto-se… 0.643(0.004) 0.639(0.004) 0.640(0.004) 0.642(0.005) 0.633(0.005) 0.642(0.004) 0.641(0.004) 0.643(0.004) 0.631(0.005)
360114 higgs 0.843(0.001) 0.841(0.001) 0.842(0.003) 0.841(0.001) 0.806(0.008) 0.834(0.001) 0.839(0.001) 0.838(0.002) 0.777(0.007)
360975 kddcup09… 0.909(0.008) 0.886(0.005) - - - 0.904(0.008) 0.909(0.004) 0.909(0.006) -
3945 kddcup09… 0.846(0.013) 0.837(0.015) 0.842(0.016) 0.836(0.015) 0.831(0.015) 0.837(0.015) 0.840(nan) 0.835(0.021) 0.830(0.016)
Table 5: Results for binary classification (in AUC) on a four hour budget, denoted as mean(std).
framework AUTOGLUON AUTO-SKLEARN AUTO-SKLEARN 2 FLAML GAMA H2O AUTOML LIGHT AUTOML MLJAR TPOT
task id task name
10090 amazon-c… 0.722(0.094) 0.852(0.164) 0.842(0.143) 1.122(0.179) 0.900(0.080) 1.196(0.209) 0.844(0.097) 1.211(0.163) 1.169(0.294)
168784 steel-pl… 0.465(0.041) 0.530(0.029) 0.472(0.031) 0.509(0.047) 0.486(0.033) 0.484(0.039) 0.488(0.027) 0.464(0.027) 0.511(0.041)
168909 dilbert 0.012(0.004) 0.033(0.012) 0.052(0.020) 0.024(0.010) 0.169(0.031) 0.044(0.007) 0.033(0.006) 0.028(0.010) 0.166(0.103)
168910 fabert 0.686(0.027) 0.744(0.029) 0.745(0.024) 0.766(0.025) 0.753(0.020) 0.729(0.031) 0.771(0.031) 0.758(0.032) 0.857(0.041)
189355 dionis 0.273(0.004) 1.138(0.273) 0.594(0.071) 0.374(0.004) 1.591(0.348) 3.351(0.120) - 1.198(0.421) 17.230(nan)
190146 vehicle 0.296(0.052) 0.365(0.044) 0.342(0.041) 0.445(0.041) 0.357(0.034) 0.328(0.062) 0.369(0.061) 0.328(0.031) 0.377(0.072)
2073 yeast 1.012(0.084) 1.038(0.079) 1.011(0.083) 1.012(0.079) 1.020(0.080) 1.058(0.094) 1.039(0.091) 1.007(0.088) 1.017(0.081)
211979 jannis 0.650(0.005) 0.665(0.006) 0.672(0.007) 0.676(0.009) 0.726(0.012) 0.669(0.006) 0.665(0.005) 0.663(0.005) 0.732(0.011)
211986 diabetes… 0.833(0.005) 0.835(0.006) 0.834(0.006) 0.833(0.006) 0.844(0.006) 0.833(0.006) 0.763(0.007) 0.829(0.006) 0.848(0.007)
359953 micro-ma… 0.257(0.084) 0.291(0.138) 0.210(0.076) 0.329(0.109) 0.242(0.113) 0.387(0.135) 0.272(0.072) 0.432(0.228) 0.375(0.139)
359954 eucalypt… 0.690(0.053) 0.742(0.075) 0.692(0.053) 0.728(0.058) 0.698(0.053) 0.689(0.052) 0.703(0.064) 0.648(0.047) 0.696(0.064)
359957 cnae-9 0.137(0.066) 0.176(0.076) 0.146(0.049) 0.165(0.049) 0.125(0.044) 0.162(0.081) 0.152(0.058) 0.207(0.097) 0.155(0.067)
359959 cmc 0.916(0.055) 0.883(0.038) 0.882(0.041) 0.897(0.040) 0.893(0.043) 0.902(0.044) 0.885(0.046) 0.893(0.056) 0.918(0.059)
359960 car 0.005(0.014) 0.003(0.005) 0.001(0.002) 0.003(0.002) 0.015(0.009) 0.002(0.003) 0.002(0.002) 0.063(0.188) 0.812(1.332)
359961 mfeat-fa… 0.069(0.028) 0.092(0.038) 0.074(0.031) 0.086(0.041) 0.078(0.030) 0.095(0.054) 0.085(0.029) 0.101(0.040) 0.113(0.066)
359963 segment 0.141(0.039) 0.178(0.039) 0.152(0.031) 0.175(0.062) 0.149(0.031) 0.157(0.040) 0.161(0.036) 0.156(0.036) 0.164(0.038)
359964 dna 0.106(0.028) 0.119(0.032) 0.111(0.026) 0.111(0.030) 0.106(0.028) 0.111(0.029) 0.109(0.025) 0.113(0.027) 0.117(0.028)
359969 first-or… 1.037(0.041) 1.104(0.032) 1.047(0.032) 1.040(0.031) 1.057(0.029) 1.042(0.033) 1.048(0.024) 1.035(0.028) 1.072(0.021)
359970 gesturep… 0.653(0.032) 0.803(0.026) 0.774(0.039) 0.773(0.036) 0.818(0.031) 0.717(0.039) 0.755(0.038) 0.722(0.034) 0.860(0.055)
359974 wine-qua… 0.700(0.027) 0.797(0.041) 0.723(0.032) 0.732(0.042) 0.755(0.020) 0.751(0.029) 0.813(0.013) 0.772(0.033) 0.808(0.025)
359976 fashion-… 0.237(0.007) 0.253(0.008) 0.263(0.010) 0.259(0.021) 0.400(0.017) 0.278(0.008) 0.250(0.008) 0.249(0.009) 0.515(0.113)
359977 connect-4 0.294(0.007) 0.355(0.013) 0.351(0.027) 0.347(0.007) 0.384(0.031) 0.309(0.007) 0.337(0.007) 0.323(0.006) 0.399(0.040)
359981 jungle_c… 0.011(0.002) 0.186(0.036) 0.233(0.022) 0.210(0.006) 0.240(0.015) 0.171(0.025) 0.148(0.018) 0.082(0.011) 0.552(1.073)
359984 helena 2.464(0.014) 2.554(0.019) 2.493(0.019) 2.584(0.022) 2.786(0.011) 2.781(0.020) 2.554(0.017) 2.601(0.032) 2.981(0.095)
359985 volkert 0.691(0.015) 0.790(0.016) 0.817(0.019) 0.834(0.074) 1.025(0.016) 0.833(0.014) 0.832(0.013) 0.793(0.015) 1.006(0.023)
359986 robert 1.444(0.014) 1.404(0.042) 1.476(0.048) 1.377(0.028) 1.697(0.058) 1.507(0.064) 1.317(0.021) 1.342(0.035) 2.017(0.161)
359987 shuttle 0.000(0.000) 0.000(0.000) 0.000(0.000) 0.000(0.001) 0.001(0.000) 0.000(0.001) 0.001(0.000) 0.000(0.000) 0.001(0.000)
359993 okcupid-… 0.559(0.009) 0.568(0.006) 0.567(0.008) 0.565(0.008) 0.568(0.007) 0.565(0.011) 0.560(0.009) 0.564(0.008) 0.572(0.008)
360112 kddcup99 0.002(0.000) 0.001(0.002) 0.000(0.000) 0.000(0.000) - - - 0.000(0.000) -
7593 covertype 0.065(0.001) 0.141(0.013) 0.113(0.005) 0.068(0.002) 0.529(0.039) 0.111(0.003) 0.085(0.001) 0.084(0.008) 0.537(0.095)
Table 6: Results for multiclass classification (in logloss) on a one hour budget, denoted as mean(std).
framework AUTOGLUON AUTO-SKLEARN AUTO-SKLEARN 2 FLAML GAMA H2O AUTOML LIGHT AUTOML MLJAR TPOT
task id task name
10090 amazon-c… 0.635(0.058) 0.809(0.122) 0.837(0.122) 1.144(0.163) 0.907(0.094) 1.172(0.167) 0.808(0.062) 1.181(0.132) 0.852(0.159)
168784 steel-pl… 0.466(0.041) 0.512(0.028) 0.472(0.029) 0.505(0.052) 0.491(0.039) 0.490(0.042) 0.488(0.033) 0.467(0.032) 0.486(0.021)
168909 dilbert 0.012(0.004) 0.033(0.012) 0.029(0.008) 0.026(0.009) 0.115(0.044) 0.023(0.005) 0.032(0.006) 0.024(0.009) 0.060(0.022)
168910 fabert 0.682(0.028) 0.756(0.035) 0.733(0.026) 0.762(0.025) 0.737(0.029) 0.728(0.031) 0.781(0.034) 0.752(0.025) 0.795(0.049)
189355 dionis 0.248(0.004) 0.491(0.037) 0.523(0.144) - 1.587(0.244) 1.469(0.127) - - -
190146 vehicle 0.298(0.050) 0.368(0.055) 0.329(0.030) 0.442(0.049) 0.369(0.032) 0.331(0.062) 0.397(0.066) 0.321(0.043) 0.339(0.065)
2073 yeast 1.015(0.087) 1.043(0.080) 1.015(0.084) 1.011(0.083) 1.019(0.081) 1.040(0.091) 1.038(0.094) 1.004(0.085) 1.029(0.083)
211979 jannis 0.647(0.006) 0.666(0.010) 0.672(0.005) 0.675(0.011) 0.698(0.009) 0.665(0.006) 0.665(0.005) 0.658(0.005) 0.715(0.011)
211986 diabetes… 0.831(0.006) 0.834(0.005) 0.832(0.005) 0.832(0.006) 0.837(0.006) 0.833(0.006) 0.762(0.008) 0.828(0.006) 0.843(0.005)
359953 micro-ma… 0.252(0.088) 0.271(0.093) 0.189(0.073) 0.307(0.129) 0.223(0.092) 0.329(0.152) 0.284(0.112) 0.460(0.202) 0.289(0.153)
359954 eucalypt… 0.690(0.053) 0.716(0.047) 0.704(0.061) 0.779(0.121) 0.700(0.057) 0.702(0.087) 0.695(0.058) 0.646(0.054) 0.752(0.130)
359957 cnae-9 0.137(0.068) 0.178(0.076) 0.143(0.043) 0.139(0.048) 0.132(0.044) 0.164(0.103) 0.149(0.058) 0.176(0.057) 0.150(0.077)
359959 cmc 0.920(0.057) 0.889(0.043) 0.884(0.037) 0.899(0.045) 0.893(0.043) 0.898(0.043) 0.887(0.044) 0.888(0.054) 0.908(0.060)
359960 car 0.004(0.011) 0.004(0.008) 0.002(0.004) 0.003(0.005) 0.012(0.008) 0.001(0.001) 0.002(0.002) 0.002(0.003) 1.450(3.004)
359961 mfeat-fa… 0.067(0.028) 0.093(0.033) 0.074(0.030) 0.093(0.042) 0.082(0.028) 0.098(0.042) 0.080(0.029) 0.102(0.029) 0.108(0.042)
359963 segment 0.054(0.024) 0.084(0.031) 0.062(0.026) 0.079(0.041) 0.067(0.026) 0.159(0.040) 0.061(0.021) 0.058(0.021) 0.071(0.032)
359964 dna 0.106(0.027) 0.116(0.032) 0.111(0.025) 0.106(0.029) 0.106(0.028) 0.109(0.030) 0.109(0.026) 0.109(0.025) 0.112(0.025)
359969 first-or… 1.039(0.038) 1.103(0.035) 1.041(0.030) 1.037(0.027) 1.052(0.027) 1.049(0.039) 1.046(0.026) 1.032(0.029) 1.062(0.035)
359970 gesturep… 0.652(0.033) 0.807(0.021) 0.768(0.029) 0.763(0.028) 0.807(0.039) 0.762(0.032) 0.757(0.039) 0.720(0.035) 0.834(0.044)
359974 wine-qua… 0.698(0.027) 0.793(0.038) 0.715(0.028) 0.726(0.050) 0.772(0.018) 0.757(0.033) 0.791(0.021) 0.755(0.030) 0.787(0.021)
359976 fashion-… 0.217(0.009) 0.242(0.009) 0.247(0.012) 0.257(0.032) 0.356(0.009) 0.253(0.009) 0.250(0.008) 0.245(0.008) 0.415(0.029)
359977 connect-4 0.293(0.007) 0.348(0.008) 0.342(0.013) 0.339(0.005) 0.346(0.028) 0.318(0.028) 0.321(0.006) 0.316(0.007) 0.378(0.023)
359981 jungle_c… 0.006(0.002) 0.171(0.045) 0.203(0.021) 0.211(0.005) 0.217(0.022) 0.107(0.011) 0.102(0.009) 0.073(0.006) 1.078(2.885)
359984 helena 2.470(0.016) 2.526(0.018) 2.485(0.031) 2.564(0.019) 2.731(nan) 2.794(0.018) 2.504(0.014) 2.575(0.021) 2.922(0.039)
359985 volkert 0.674(0.009) 0.780(0.022) 0.833(0.038) 0.812(0.011) 0.951(0.016) 0.781(0.012) 0.781(0.011) 0.794(0.013) 0.973(0.040)
359986 robert 1.265(0.030) 1.324(0.027) 1.383(0.031) 1.307(0.012) 1.617(0.030) 1.385(0.025) 1.276(0.024) 1.306(0.029) 1.809(0.114)
359987 shuttle 0.000(0.000) 0.000(0.001) 0.000(0.000) 0.000(0.000) 0.000(0.000) 0.000(0.001) 0.001(0.000) 0.000(0.000) 0.000(0.000)
359993 okcupid-… 0.559(0.009) 0.567(0.007) 0.563(0.008) 0.562(0.008) 0.568(0.007) 0.567(0.008) 0.560(0.009) 0.563(0.008) 0.569(0.009)
360112 kddcup99 0.002(0.000) 0.000(0.000) 0.000(0.000) 0.000(0.000) - - - 0.000(0.000) -
7593 covertype 0.057(0.001) 0.097(0.022) 0.095(0.010) 0.067(0.003) 0.255(0.048) 0.086(0.002) 0.071(0.002) 0.083(0.006) 0.459(0.139)
Table 7: Results for multiclass classification (in logloss) on a four hour budget, denoted as mean(std).
framework AUTOGLUON AUTO-SKLEARN FLAML GAMA H2O AUTOML LIGHT AUTOML MLJAR TPOT
task id task name
167210 moneyball 21(0.85) 21(0.61) 22(0.83) 21(0.79) 22(2.2) 21(0.71) 21(0.84) 21(0.85)
233211 diamonds 5.1e+02(20) 5.2e+02(20) 5.2e+02(19) 5.2e+02(22) 5.1e+02(18) 5.2e+02(22) 5.1e+02(22) 5.4e+02(13)
233212 allstate… 1.9e+03(41) 1.9e+03(60) 1.9e+03(51) 2e+03(72) 1.9e+03(47) 1.9e+03(51) 1.9e+03(56) 2e+03(62)
233213 buzzinso… 1.5e+02(50) 1.6e+02(46) 1.5e+02(50) 1.5e+02(38) 1.5e+02(46) 1.6e+02(45) 1.5e+02(51) 1.6e+02(51)
233214 santande… 6.8e+06(4.4e+05) 6.9e+06(4.4e+05) 7e+06(4.4e+05) 7e+06(4.9e+05) 6.9e+06(4.5e+05) 6.9e+06(4.8e+05) 6.9e+06(4.7e+05) 7e+06(5.1e+05)
233215 mercedes… 8.6(1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1)
317614 yolanda 8.5(0.041) 8.7(0.063) 8.6(0.051) 9.4(0.066) 8.8(0.041) 8.6(0.038) 8.6(0.064) 9.5(0.055)
359929 airlines… 29(0.23) 29(0.22) 29(0.26) 29(0.24) 29(0.24) 29(0.23) 29(0.2) 29(0.24)
359930 quake 0.19(0.0089) 0.19(0.0088) 0.19(0.0093) 0.19(0.0092) 0.19(0.01) 0.19(0.0098) 0.19(0.009) 0.19(0.0092)
359931 sensory 0.67(0.061) 0.7(0.063) 0.69(0.052) 0.68(0.057) 0.69(0.063) 0.68(0.062) 0.68(0.047) 0.69(0.067)
359932 socmob 12(7.9) 12(4.9) 15(8) 15(7.5) 13(9.6) 19(9.1) 28(55) 17(8.7)
359933 space_ga 0.094(0.013) 0.1(0.02) 0.1(0.016) 0.096(0.019) 0.098(0.015) 0.1(0.017) 0.098(0.018) 0.1(0.019)
359934 tecator 0.83(0.18) 0.67(0.16) 0.85(0.16) 0.81(0.36) 0.65(0.13) 0.78(0.25) 0.87(0.31) 0.62(0.23)
359935 wine_qua… 0.57(0.022) 0.6(0.019) 0.57(0.021) 0.57(0.02) 0.58(0.024) 0.58(0.023) 0.57(0.023) 0.58(0.024)
359936 elevators 0.0018(5.2e-05) 0.002(7.6e-05) 0.002(6.6e-05) 0.002(6.7e-05) 0.0021(5.6e-05) 0.002(5.7e-05) 0.0019(6e-05) 0.002(8.8e-05)
359937 black_fr… 3.5e+03(27) 3.4e+03(30) 3.4e+03(29) 3.5e+03(31) 3.4e+03(29) 3.4e+03(27) 3.4e+03(28) 3.5e+03(32)
359938 brazilia… 1.2e+04(2e+04) 1.5e+03(4.7e+03) 1.4e+03(2.7e+03) 4.6(4.9) 2.7e+02(2e+02) 3.9(5) 1.9e+16(6.1e+16) 4(5)
359939 topo_2_1 0.028(0.0049) 0.028(0.0049) 0.028(0.0048) 0.028(0.0048) 0.028(0.0049) 0.028(0.0049) 0.028(0.0048) 0.028(0.0049)
359940 yprop_4_1 0.028(0.0049) 0.028(0.0048) 0.028(0.0049) 0.028(0.0049) 0.028(0.0049) 0.028(0.0049) 0.028(0.0049) 0.028(0.0048)
359941 onlinene… 9.6e+05(3e+06) 1.1e+04(3.7e+03) 1.1e+04(3.6e+03) 3e+06(9.5e+06) 1.1e+04(3.7e+03) 1.1e+04(3.5e+03) 1.1e+04(3.7e+03) 1.1e+04(3.7e+03)
359942 colleges 0.13(0.0059) 0.14(0.0057) 0.14(0.0058) 0.14(0.0052) 0.14(0.0059) 0.14(0.0053) 0.14(0.006) 0.14(0.0055)
359943 nyc-taxi… 1.5(0.15) 1.8(0.15) 1.6(0.15) 1.8(0.18) 1.7(0.16) 1.7(0.18) 1.6(0.18) 1.8(0.17)
359944 abalone 2.1(0.12) 2.1(0.11) 2.1(0.12) 2.1(0.1) 2.1(0.1) 2.1(0.12) 2.1(0.12) 2.1(0.11)
359945 us_crime 0.13(0.0062) 0.13(0.0065) 0.13(0.0044) 0.13(0.0065) 0.13(0.007) 0.13(0.0057) 0.13(0.0062) 0.13(0.0067)
359946 pol 2.7(0.29) 3.8(0.54) 3.7(0.36) 3.9(0.22) 3.3(0.45) 3.9(0.31) 2.3(0.26) 4(0.31)
359948 sat11-ha… 8.8e+02(67) 1.1e+03(66) 1e+03(65) 1.1e+03(61) 9.5e+02(58) 1.2e+03(1e+02) 1e+03(95) 1.1e+03(44)
359949 house_sa… 1.1e+05(1.1e+04) 1.2e+05(1.5e+04) 1.1e+05(1.6e+04) 1.1e+05(1.7e+04) 1.1e+05(1.3e+04) 1.1e+05(1.6e+04) 1.1e+05(1.4e+04) 1.2e+05(1.8e+04)
359950 boston 2.8(0.83) 2.8(1) 2.9(1) 3.1(0.99) 3(1.1) 2.9(1) 3(0.87) 2.9(0.87)
359951 house_pr… 2.4e+04(6.7e+03) 2.6e+04(1e+04) 2.5e+04(7.7e+03) 2.5e+04(6.4e+03) 2.8e+04(1.1e+04) 2.5e+04(6.5e+03) 2.5e+04(7e+03) 2.9e+04(8.5e+03)
359952 house_16h 2.8e+04(2.3e+03) 3e+04(2.4e+03) 3e+04(2e+03) 3e+04(2e+03) 2.9e+04(1.9e+03) 2.9e+04(2.1e+03) 2.9e+04(2e+03) 3.1e+04(1.7e+03)
360932 qsar-tid… 0.72(0.072) 0.77(0.067) 0.72(0.071) 0.75(0.066) 0.73(0.071) 0.73(0.068) 0.71(0.052) 0.76(0.068)
360933 qsar-tid… 0.69(0.022) 0.73(0.026) 0.69(0.022) 0.71(0.031) 0.7(0.025) 0.69(0.021) 0.7(0.025) 0.72(0.026)
360945 mip-2016… 2.1e+04(1.6e+03) 2.2e+04(1.6e+03) 2.1e+04(1.7e+03) 2.1e+04(1.4e+03) 2.1e+04(2.1e+03) 2.1e+04(1.5e+03) 2.2e+04(2e+03) 2.2e+04(1.5e+03)
Table 8: Results for regression (in RMSE) on a one hour budget, denoted as mean(std).
framework AUTOGLUON AUTO-SKLEARN FLAML GAMA H2O AUTOML LIGHT AUTOML MLJAR TPOT
task id task name
167210 moneyball 21(0.86) 21(0.85) 22(0.87) 21(0.77) 22(0.98) 21(0.74) 21(0.86) 21(0.87)
233211 diamonds 5.1e+02(19) 5.2e+02(19) 5.2e+02(22) 5.2e+02(20) 5.1e+02(19) 5.2e+02(22) 5.1e+02(22) 5.3e+02(23)
233212 allstate… 1.9e+03(39) 1.9e+03(58) 1.9e+03(46) 8.6e+12(2.7e+13) 1.9e+03(47) 1.9e+03(49) 1.9e+03(60) 1.9e+03(50)
233213 buzzinso… 1.5e+02(49) 1.5e+02(50) 1.5e+02(49) 1.6e+02(47) 1.5e+02(49) 1.6e+02(45) 1.5e+02(49) 1.6e+02(46)
233214 santande… 6.8e+06(4.4e+05) 6.8e+06(4.8e+05) 6.8e+06(4.9e+05) 6.9e+06(4.3e+05) 6.9e+06(4.7e+05) 6.9e+06(5.3e+05) 6.8e+06(4.6e+05) 7e+06(4.4e+05)
233215 mercedes… 8.6(1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1) 8.3(1.1)
317614 yolanda 8.3(0.043) 8.7(0.051) 8.6(0.06) 9.2(0.1) 8.8(0.043) 8.6(0.042) 8.5(0.052) 9.3(0.11)
359929 airlines… 29(0.24) 29(0.28) 29(0.21) 29(0.25) 29(0.24) 29(0.24) 29(0.26) 29(0.28)
359930 quake 0.19(0.0093) 0.19(0.0089) 0.19(0.0091) 0.19(0.0092) 0.19(0.0094) 0.19(0.0099) 0.19(0.0093) 0.19(0.0096)
359931 sensory 0.67(0.061) 0.69(0.051) 0.69(0.054) 0.68(0.055) 0.7(0.062) 0.69(0.061) 0.67(0.043) 0.68(0.054)
359932 socmob 12(8.9) 11(3.5) 15(8) 14(7) 14(13) 19(9.3) 20(29) 16(8)
359933 space_ga 0.094(0.013) 0.1(0.025) 0.1(0.015) 0.096(0.019) 0.097(0.012) 0.1(0.017) 0.099(0.018) 0.099(0.018)
359934 tecator 0.83(0.18) 0.76(0.34) 0.85(0.17) 0.82(0.33) 0.82(0.3) 0.79(0.26) 0.85(0.19) 0.56(0.094)
359935 wine_qua… 0.57(0.021) 0.61(0.019) 0.57(0.022) 0.57(0.022) 0.58(0.023) 0.58(0.023) 0.57(0.024) 0.57(0.023)
359936 elevators 0.0018(5.2e-05) 0.0019(7.3e-05) 0.002(6.5e-05) 0.0019(6.5e-05) 0.002(0.00013) 0.002(5.7e-05) 0.0019(5.8e-05) 0.0019(6.4e-05)
359937 black_fr… 3.5e+03(28) 3.4e+03(30) 3.4e+03(27) 3.5e+03(29) 3.4e+03(30) 3.4e+03(27) 3.4e+03(29) 3.5e+03(30)
359938 brazilia… 4.6e+04(1.2e+05) 1.5e+03(4.7e+03) 1.5e+03(2.7e+03) 4.4(4.9) 2.3e+02(2e+02) 3.8(5) 6.2e+14(2e+15) 4.2(4.9)
359939 topo_2_1 0.028(0.0049) 0.028(0.0049) 0.028(0.0048) 0.028(0.0048) 0.028(0.0049) 0.028(0.0049) 0.028(0.0048) 0.028(0.0048)
359940 yprop_4_1 0.028(0.0049) 0.028(0.0048) 0.028(0.0048) 0.028(0.0049) 0.028(0.0049) 0.028(0.0049) 0.028(0.0049) 0.028(0.0048)
359941 onlinene… 4.3e+09(1.4e+10) 1.1e+04(3.7e+03) 1.1e+04(3.8e+03) 1.1e+04(3.7e+03) 1.2e+04(4.9e+03) 1.1e+04(3.6e+03) 1.3e+04(6e+03) 1.1e+04(3.7e+03)
359942 colleges 0.13(0.0059) 0.14(0.006) 0.14(0.0061) 0.14(0.0048) 0.14(0.0055) 0.14(0.0054) 0.14(0.006) 0.14(0.005)
359943 nyc-taxi… 1.6(0.14) 1.6(0.23) 1.6(0.044) 1.7(0.09) 1.6(0.15) 1.7(0.17) 1.6(0.14) 1.8(0.16)
359944 abalone 2.1(0.12) 2.1(0.11) 2.1(0.12) 2.1(0.1) 2.1(0.11) 2.1(0.12) 2.1(0.12) 2.1(0.11)
359945 us_crime 0.13(0.0061) 0.13(0.0066) 0.13(0.0048) 0.13(0.0059) 0.13(0.0064) 0.13(0.0059) 0.13(0.0052) 0.13(0.0058)
359946 pol 2.6(0.29) 3.3(0.35) 3.6(0.37) 3.7(0.3) 3.4(0.28) 3.9(0.33) 2.2(0.23) 3.7(0.38)
359948 sat11-ha… 8.8e+02(67) 1.1e+03(73) 9.9e+02(63) 1.1e+03(63) 9.3e+02(63) 1.2e+03(77) 1.1e+03(1.1e+02) 1e+03(69)
359949 house_sa… 1.1e+05(1.1e+04) 1.1e+05(1.4e+04) 1.1e+05(1.8e+04) 1.1e+05(1.6e+04) 1.1e+05(1.2e+04) 1.1e+05(1.6e+04) 1.1e+05(1.2e+04) 1.2e+05(1.2e+04)
359950 boston 2.8(0.83) 2.9(1) 2.8(0.62) 3(0.99) 2.8(0.81) 2.9(1.1) 3(0.93) 3.1(1)
359951 house_pr… 2.5e+04(7.6e+03) 2.5e+04(9.4e+03) 2.6e+04(7.7e+03) 2.4e+04(5.9e+03) 2.6e+04(8.5e+03) 2.5e+04(6.8e+03) 2.5e+04(7.1e+03) 2.7e+04(7.8e+03)
359952 house_16h 2.8e+04(2.3e+03) 2.9e+04(2.4e+03) 2.9e+04(2.1e+03) 3e+04(2.2e+03) 2.9e+04(1.9e+03) 2.9e+04(2.1e+03) 2.9e+04(2e+03) 3.1e+04(2.1e+03)
360932 qsar-tid… 0.72(0.075) 0.77(0.067) 0.72(0.073) 0.75(0.067) 0.73(0.072) 0.72(0.068) 0.7(0.032) 0.74(0.069)
360933 qsar-tid… 0.69(0.022) 0.73(0.033) 0.68(0.023) 0.71(0.025) 0.69(0.023) 0.69(0.021) 0.7(0.022) 0.71(0.027)
360945 mip-2016… 2.1e+04(1.6e+03) 2.2e+04(1.6e+03) 2.2e+04(1.5e+03) 2.1e+04(1.4e+03) 2.1e+04(1.9e+03) 2.1e+04(1.5e+03) 2.2e+04(1.4e+03) 2.2e+04(2.2e+03)
Table 9: Results for regression (in RMSE) on a four hour budget, denoted as mean(std).

b.1 BT-Trees

As described in Section 6.2, Bradley-Terry (BT) trees may be used to identify subsets of tasks for which the ‘preferred’ framework is significantly different. Figures 9-11 show BT trees for each task type and time budget, generated by only splitting based on the number of instances or features. We observe that when splitting on only these meta-features AUTOGLUON is the preferred framework in many cases, but almost every other framework is preferred over AUTOGLUON in at least one subsection of the tasks. The online visualization tool may be used to generate additional BT trees, from different subsets of tasks or allowing for different meta-features when calculating splits.

(a) Binary classification datasets, 1h.
(b) Binary classification datasets, 4h.
Figure 9: Binary classification datasets, 1h and 4h.
(a) Multiclass classification datasets, 1h.
(b) Multiclass classification datasets, 4h.
Figure 10: Multiclass classification datasets, 1h and 4h.
(a) Regression datasets, 1h.
(b) Regression datasets, 4h.
Figure 11: Regression datasets, 1h and 4h.

b.2 Model Accuracy vs. Inference Time Trade-offs

Below are the same plots as presented in the results section, but using the median inference time instead of the mean.

Figure 12: Pareto Frontiers of framework performance across tasks after scaling the performance values from the worst framework (0) to best observed (1).

Appendix C Software

This appendix contains additional details about the developed and used software.

c.1 Architecture Overview

Figure 13 shows the architecture and information flow when running the benchmarking tool in the setup used in the experiments for this paper. The dotted lines indicate calls (communication) whereas the solid lines indicate transfer of files. In this paper we distribute our workload to AWS EC2 containers which run experiments within a docker environment.

The experiments are initiated from a ‘local’ machine, shown enclosed in a blue rectangle in the bottom. This local instantiation first reads information from a local configuration file and the command line, after which it uses boto3 to connect to AWS. It uploads local files required for the experiments, such as the configuration files, to an S3 bucket accessible by the EC2 instances, deploys the jobs to EC2 instances, and tracks the instance status through CloudWatch.

The EC2 instance then installs the specified version of the benchmarking tool (here of type m5.2xlarge), retrieves the user configuration from the S3 bucket, the specified docker image from the docker hub, and the dataset (task) from OpenML. The experiment is then run in a docker container (which in turn uses the benchmarking software in ‘local’ mode) which has the benefits of reproducibility of the software stack and requiring much less time than installing the framework from scratch.

After the experiment has completed, results are uploaded to an S3 bucket and the EC2 instance terminates. The local machine will observe this shutdown and subsequently fetch the downloaded results from the S3 bucket.

As you can see, with this setup there are almost no requirements to the local machine. Provided the docker images are built and published, the specified benchmark tool version is available online and OpenML is used, the only data transfer between the local machine and the cloud is the upload of configuration files and the download of results.

Figure 13: Architecture Overview of the AWS+docker mode, as used for this paper

c.2 Framework Versions

In principle the latest version of each AutoML framework as of of September in 2021 was used.202020In GAMA version 21.0.0, released in January 2021, ensembling was only available through the ‘postprocessing’ hyperparameter. To comply with the benchmark design of only allowing a ‘preset’ hyperparameter, the 21.0.1 update introduced a ‘performance’ preset that uses ensembling. The specific version number of each framework can be seen in Table 10. For frameworks that are installed directly from Github, only the first 10 characters of the commit hash are shown. For each framework, the latest available version as of the of April 2022 is also shown.

framework benchmarked latest notes
AUTOGLUON 0.3.1 0.4.0 ‘best quality’ preset
AUTO-SKLEARN 0.14.0 0.14.6
AUTO-SKLEARN 2 0.14.0 0.14.6
FLAML 0.6.2 1.0.0
GAMA 21.0.1 21.0.1 ‘performance’ preset
H2O AUTOML 3.34.0.1 3.36.1.1
LIGHTAUTOML 0.2.16 0.3.3
MLJAR 0.11.0 0.11.2 ‘compete’ preset
ML-PLAN - 0.2.4 currently excluded, WEKA backend
MLR3AUTOML - f667900292 currently excluded
TPOT 0.11.7 0.11.7
Table 10: Used AutoML framework versions in the experiments.

Appendix D AutoML Framework Errors

This appendix contains additional information on the AutoML framework errors encountered while running the experiments. As mentioned in Section 6.4, the amount of errors increase with higher time budgets, this can be seen in Figure 14. In almost all cases, the number of errors observed increased with a higher time budget.

Figure 14: Errors by benchmark suite and time budget.

d.1 Class Imbalance

The two classification tasks with a large amount of failures despite being small are ‘yeast’ and ‘wine-quality-white’, which feature a minority class with only 5 instances. This means that within the 10-fold cross-validation we perform in our experiments, either 4 or 5 of those instances are available in the training splits. We see that only in the case where one of those samples is in the test set failures occur. The exact error message differs per framework, though they indicate that evaluating pipelines fails. This is likely due to e.g., using 5-fold cross-validation out of the box. Failure on these specific datasets (and folds) is only observed for GAMA, LightAutoML, and TPOT.

d.2 MLJarSupervised

Two thirds of all ‘implementation errors’ observed are failures of MLJarSupervised. All 190 failures are caused by variations of the following two unique errors:

  • [’Ensemble_prediction_0_for_neg_1_for_pos’, ,
    ’2_DecisionTree_prediction_0_for_neg_1_for_pos’] not in index"

  • catboost/libs/data/model_dataset_compatibility.cpp:81:
    At position 6 should be feature with name 60_NeuralNetwork_prediction_0_for_1_1_for_2
    (found 60_NeuralNetwork_prediction).

This is specific to MLJarSupervised. While we can only guess, we assume it is related to the extensive AutoML pipeline MLJarSupervised has. It includes 10 different steps, including three steps for feature generation and selection and three steps for ensembling and stacking. These steps are not turned on by default212121https://supervised.mljar.com/features/modes/, feature engineering is only turned on for ‘performance’ and ‘compete’, and ensembling and stacking is only used in ‘compete’ mode, which we used.


References

  • S. P. Arango, H. S. Jomaa, M. Wistuba, and J. Grabocka (2021) HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML. arXiv:2106.06257 [cs]. Cited by: §2.2.
  • A. Balaji and A. Allen (2018) Benchmarking automatic machine learning frameworks. CoRR abs/1808.06492. External Links: Link, 1808.06492 Cited by: §1.1, §2.1.
  • J. Bergstra, R. Bardenet, B. Kégl, and Y. Bengio (2011) Implementations of algorithms for hyper-parameter optimization. In NIPS Workshop on Bayesian optimization, pp. 29. Cited by: §3.1.6.
  • J. Bergstra, B. Komer, C. Eliasmith, and D. Warde-Farley (2014) Preliminary Evaluation of Hyperopt Algorithms on HPOLib. In ICML Workshop on Automatic Machine Learning, Cited by: §2.2.
  • J. Bergstra, D. Yamins, and D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp. 115–123. Cited by: §2.1.
  • B. Bischl, G. Casalicchio, M. Feurer, P. Gijsbers, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn, and J. Vanschoren (2021) OpenML benchmarking suites. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: §2, §4.2, §5.1.1, §5.1, §5.
  • R. Caruana, A. Munson, and A. Niculescu-Mizil (2006) Getting the most out of ensemble selection. In Sixth International Conference on Data Mining (ICDM’06), pp. 828–833. Cited by: §3.1.2, §3.1.4, §6.3.
  • R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes (2004) Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning, pp. 18. Cited by: §3.1.1, §3.1.2, §3.1.4, §6.3.
  • W. Chang, J. Cheng, J. Allaire, Y. Xie, J. McPherson, et al. (2017) Shiny: web application framework for r. R package version 1 (5), pp. 2017. Cited by: §6.
  • T. Chen and C. Guestrin (2016) XGBoost: A scalable tree boosting system. CoRR abs/1603.02754. External Links: Link, 1603.02754 Cited by: §3.1.3.
  • K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan (2002)

    A fast and elitist multiobjective genetic algorithm: nsga-ii

    .

    IEEE transactions on evolutionary computation

    6 (2), pp. 182–197.
    Cited by: §3.1.4.
  • J. Demšar (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. The Journal of Machine Learning Research 7, pp. 1–30. Cited by: §6.1.
  • I. Drori, Y. Krishnamurthy, R. Rampin, R. Lourenço, J. One, K. Cho, C. Silva, and J. Freire (2018) AlphaD3M: machine learning pipeline synthesis. In 5th ICML Workshop on Automated Machine Learning (AutoML), Cited by: §1.1.
  • K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown (2013) Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters. In NIPS Workshop on Bayesian Optimization in Theory and Practice, Cited by: §2.2.
  • K. Eggensperger, P. Müller, N. Mallik, M. Feurer, R. Sass, A. Klein, N. Awad, M. Lindauer, and F. Hutter (2021) HPOBench: a collection of reproducible multi-fidelity benchmark problems for hpo. External Links: 2109.06716 Cited by: §2.2.
  • N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola (2020) AutoGluon-tabular: robust and accurate automl for structured data. External Links: 2003.06505 Cited by: §2.1, §3.1.1, Table 1, §5.3.1.
  • K. Erol, J. A. Hendler, and D. S. Nau (1994) UMCP: a sound and complete procedure for hierarchical task-network planning.. In Aips, Vol. 94, pp. 249–254. Cited by: §1.
  • H. J. Escalante, M. Montes, and L. E. Sucar (2009) Particle swarm model selection.. Journal of Machine Learning Research 10 (2). Cited by: §3.
  • M. J.A. Eugster, F. Leisch, and C. Strobl (2014) (Psycho-)analysis of benchmark experiments. Comput. Stat. Data Anal. 71 (C), pp. 986–1000. External Links: ISSN 0167-9473 Cited by: §6.2, §6.2, §6.2.
  • M. Ferreira, R. Ventorim, E. Almeida, S. Silveira, and W. Silveira (2021) Protein abundance prediction through machine learning methods. Journal of molecular biology 433 (22), pp. 167267. Cited by: §1.1.
  • M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015a) Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pp. 2962–2970. Cited by: §1.1, §3.1.1, §3.1.2, Table 1, §5.3.3.
  • M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter (2018) Practical automated machine learning for the automl challenge 2018. In International Workshop on Automatic Machine Learning at ICML, pp. 1189–1232. Cited by: §3.1.2.
  • M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter (2020) Auto-sklearn 2.0: hands-free automl via meta-learning. arXiv. External Links: Document, Link Cited by: §3.1.2, Table 1, §5.3.3.
  • M. Feurer, J. Springenberg, and F. Hutter (2015b) Initializing bayesian hyperparameter optimization via meta-learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 29. Cited by: §3.1.2.
  • P. Gijsbers, E. LeDell, S. Poirier, J. Thomas, B. Bischl, and J. Vanschoren (2019) An open source automl benchmark. arXiv preprint arXiv:1907.00909 [cs.LG]. Note: Accepted at AutoML Workshop at ICML 2019 External Links: Link Cited by: §3, §5.3.3, footnote 1.
  • P. Gijsbers and J. Vanschoren (2021) GAMA: a general automated machine learning assistant. In Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track, Y. Dong, G. Ifrim, D. Mladenić, C. Saunders, and S. Van Hoecke (Eds.), Cham, pp. 560–564. External Links: ISBN 978-3-030-67670-4 Cited by: §3.1.4, Table 1.
  • Y. Gil, K. Yao, V. Ratnakar, D. Garijo, G. Ver Steeg, P. Szekely, R. Brekelmans, M. Kejriwal, F. Luo, and I. Huang (2018) P4ML: a phased performance-based pipeline planner for automated machine learning. In 5th ICML Workshop on Automated Machine Learning (AutoML), Cited by: §1.1.
  • I. Guyon, L. Sun-Hosoya, M. Boullé, H. J. Escalante, S. Escalera, Z. Liu, D. Jajetic, B. Ray, M. Saeed, M. Sebag, et al. (2019) Analysis of the automl challenge series. Automated Machine Learning, pp. 177. Cited by: §2.1, §3.1.2, §4.1, §5.1.1.
  • H2O.ai (2013) H2O: scalable machine learning platform. Note: First version of H2O was released in 2013 External Links: Link Cited by: §3.1.5.
  • M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten (2009) The weka data mining software: an update. SIGKDD Explor. Newsl. 11 (1), pp. 10–18. External Links: ISSN 1931-0145, Link, Document Cited by: §3.
  • N. Hansen, A. Auger, R. Ros, O. Mersmann, T. Tušar, and D. Brockhoff (2021) COCO: a platform for comparing continuous optimizers in a black-box setting. Optimization Methods and Software 36 (1), pp. 114–144. Cited by: §2.2.
  • F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §1.
  • F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2019) Automated machine learning - methods, systems, challenges. Springer. Cited by: §1.
  • K. Jamieson and A. Talwalkar (2016) Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, A. Gretton and C. C. Robert (Eds.), Proceedings of Machine Learning Research, Vol. 51, Cadiz, Spain, pp. 240–248. External Links: Link Cited by: §3.1.2.
  • H. Jin, Q. Song, and X. Hu (2019) Auto-keras: an efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1946–1956. Cited by: §3.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017)

    Lightgbm: a highly efficient gradient boosting decision tree

    .
    Advances in neural information processing systems 30, pp. 3146–3154. Cited by: §3.1.3.
  • A. Klein, Z. Dai, F. Hutter, N. Lawrence, and J. Gonzalez (2019) Meta-surrogate benchmarking for hyperparameter optimization. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §2.2.
  • H. Kotthaus, I. Korb, M. Lang, B. Bischl, J. Rahnenführer, and P. Marwedel (2015) Runtime and memory consumption analyses for machine learning r programs. Journal of Statistical Computation and Simulation 85 (1), pp. 14–29. External Links: Document Cited by: §2.
  • J. R. Koza and J. R. Koza (1992) Genetic programming: on the programming of computers by means of natural selection. Vol. 1, MIT press. Cited by: §1.
  • T. T. Le, W. Fu, and J. H. Moore (2018) Scaling tree-based automated machine learning to biomedical big data with a dataset selector. BioRxiv, pp. 502484. Cited by: §3.1.8.
  • E. LeDell and S. Poirier (2020) H2O AutoML: scalable automatic machine learning. In 7th ICML workshop on automated machine learning, External Links: Link Cited by: §2.1, §3.1.5, Table 1.
  • F. Mohr, M. Wever, and E. Hüllermeier (2018) ML-plan: automated machine learning via hierarchical planning. Machine Learning 107 (8), pp. 1495–1515. External Links: ISSN 1573-0565, Document, Link Cited by: §1.1, §3.
  • F. Mohr and M. Wever (2021) Replacing the ex-def baseline in automl by naive automl. In 8th ICML Workshop on Automated Machine Learning (AutoML), Cited by: §3.2.
  • T. Ohta and H. V. Yamazaki (2022) Kurobako. GitHub. Note: https://github.com/optuna/kurobako Cited by: §2.2.
  • R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore (2017) PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData mining 10 (1), pp. 1–13. Cited by: §2.
  • R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore (2016) Evaluation of a tree-based pipeline optimization tool for automating data science. CoRR abs/1603.06212. External Links: Link, 1603.06212 Cited by: §3.1.8.
  • R. S. Olson and J. H. Moore (2016) TPOT: a tree-based pipeline optimization tool for automating machine learning. In Proceedings of the Workshop on Automatic Machine Learning, F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.), Proceedings of Machine Learning Research, Vol. 64, New York, New York, USA, pp. 66–74. External Links: Link Cited by: §3.1.8, Table 1.
  • P. Parry (2018) Auto_ml. GitHub. Note: https://github.com/ClimbsRocks/auto_ml Cited by: §2.1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §3.1.2.
  • A. Płońska and P. Płoński (2021) MLJAR: state-of-the-art automated machine learning framework for tabular data. version 0.10.3. MLJAR, Łapy, Poland. External Links: Link Cited by: §3.1.7, Table 1.
  • P. Probst, A. Boulesteix, and B. Bischl (2019) Tunability: Importance of Hyperparameters of Machine Learning Algorithms. Journal of Machine Learning Research 20 (53), pp. 1–32. External Links: ISSN 1533-7928, Link Cited by: §1.
  • L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin (2018) CatBoost: unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, pp. 6639–6649. Cited by: §3.1.3.
  • H. Rakotoarison, M. Schoenauer, and M. Sebag (2019) Automated machine learning with monte-carlo tree search. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pp. 3296–3303. Cited by: §1.1.
  • M. Reif, F. Shafait, and A. Dengel (2012) Meta-learning for evolutionary parameter optimization of classifiers. Machine learning 87 (3), pp. 357–380. Cited by: §5.3.3.
  • J. D. Romano, T. T. Le, W. Fu, and J. H. Moore (2021) TPOT-nn: augmenting tree-based automated machine learning with neural network estimators. Genetic Programming and Evolvable Machines 22 (2), pp. 207–227. Cited by: §3.1.8.
  • R. Schmucker, M. Donini, M. B. Zafar, D. Salinas, and C. Archambeau (2021) Multi-objective asynchronous successive halving. External Links: 2106.12639 Cited by: §7.2.
  • K. Šehić, A. Gramfort, J. Salmon, and L. Nardi (2021) LassoBench: a high-dimensional hyperparameter optimization benchmark suite for lasso. arXiv preprint arXiv:2111.02790. Cited by: §2.2.
  • X. Shi, J. Mueller, N. Erickson, M. Li, and A. J. Smola (2021) Benchmarking multimodal automl for tabular data with text fields. arXiv preprint arXiv:2111.02705. Cited by: 3rd item.
  • J. Siems, L. Zimmer, A. Zela, J. Lukasik, M. Keuper, and F. Hutter (2020) NAS-bench-301 and the case for surrogate benchmarks for neural architecture search. CoRR abs/2008.09777. External Links: Link, 2008.09777 Cited by: §2.2.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems 25. Cited by: §1.
  • A. Sohn, R. S. Olson, and J. H. Moore (2017) Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming. In Proceedings of the genetic and evolutionary computation conference, pp. 489–496. Cited by: §3.1.8.
  • C. Strobl, F. Wickelmaier, and A. Zeileis (2011) Accounting for individual differences in bradley-terry models by means of recursive partitioning. Journal of Educational and Behavioral Statistics 36 (2), pp. 135–153. External Links: Document, Link, https://doi.org/10.3102/1076998609359791 Cited by: §6.2, §7.
  • J. Thomas, S. Coors, and B. Bischl (2018) Automatic gradient boosting. In International Workshop on Automatic Machine Learning at ICML, Cited by: §3.
  • C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In Proc. of KDD-2013, pp. 847–855. Cited by: §1.1, §2.1, §3, §5.1.1.
  • A. Truong, A. Walters, J. Goodsitt, K. E. Hines, C. B. Bruss, and R. Farivar (2019) Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools. In 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019, Portland, OR, USA, November 4-6, 2019, pp. 1471–1479. External Links: Document Cited by: §2.1, §2.1, §5.3.1.
  • R. Turner (2022) Uber. bayesopt benchmark. GitHub. Note: https://github.com/uber/bayesmark Cited by: §2.2.
  • A. Vakhrushev, A. Ryzhkov, M. Savchenko, D. Simakov, R. Damdinov, and A. Tuzhilin (2021) LightAutoML: automl solution for a large financial services ecosystem. arXiv preprint arXiv:2109.01528. Cited by: §3.1.6, Table 1.
  • K. Van der Blom, A. Serban, H. Hoos, and J. Visser (2021) AutoML adoption in ml software. In 8th ICML Workshop on Automated Machine Learning (AutoML), Cited by: §1.
  • T. Van Gestel, J. A. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, and J. Vandewalle (2004)

    Benchmarking least squares support vector machine classifiers

    .
    Machine learning 54 (1), pp. 5–32. Cited by: §2.
  • J. N. Van Rijn and F. Hutter (2018) Hyperparameter importance across datasets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2367–2376. Cited by: §1, §3.2.
  • J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo (2014) OpenML: networked science in machine learning. SIGKDD Explor. Newsl. 15 (2), pp. 49–60. Cited by: §2.1, §2.2.
  • C. Wang, Q. Wu, M. Weimer, and E. Zhu (2021) FLAML: a fast and lightweight automl library. In MLSys, Cited by: §3.1.3, Table 1.
  • H. J. Weerts, A. C. Mueller, and J. Vanschoren (2020) Importance of tuning hyperparameters of machine learning algorithms. arXiv preprint arXiv:2007.07588. Cited by: §1.
  • M. Wever, A. Tornede, F. Mohr, and E. Hullermeier (2021) AutoML for Multi-Label Classification: Overview and Empirical Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–19. External Links: Document Cited by: §2.1.
  • Q. Wu, C. Wang, and S. Huang (2021) Frugal optimization for cost-related hyperparameters. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 10347–10354. Cited by: §3.1.3.
  • Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical science 9 (2), pp. 513–530. Cited by: §2.
  • C. Yang, Y. Akimoto, D.W. Kim, and M. Udell (2018) OBOE: collaborative filtering for automl initialization. CoRR abs/1808.03233. External Links: Link, 1808.03233 Cited by: §3, §5.3.3.
  • C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter (2019) NAS-bench-101: towards reproducible neural architecture search. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 7105–7114. External Links: Link Cited by: §2.2.
  • A. Zeileis and K. Hornik (2007) Generalized m-fluctuation tests for parameter instability. Statistica Neerlandica 61 (4), pp. 488–508. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9574.2007.00371.x Cited by: §6.2.
  • A. Zela, J. Siems, and F. Hutter (2020) NAS-bench-1shot1: benchmarking and dissecting one-shot neural architecture search. CoRR abs/2001.10422. External Links: Link, 2001.10422 Cited by: §2.2.
  • L. Zimmer, M. Lindauer, and F. Hutter (2021)

    Auto-pytorch tabular: multi-fidelity metalearning for efficient and robust autodl

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 3079 – 3090. Note: also available under https://arxiv.org/abs/2006.13799 Cited by: §3.
  • M. Zöller and M. F. Huber (2021) Benchmark and survey of automated machine learning frameworks. J. Artif. Intell. Res. 70, pp. 409–472. External Links: Document Cited by: §2.1, §2.1, §5.3.1.