LightAutoML: AutoML Solution for a Large Financial Services Ecosystem

09/03/2021
by   Anton Vakhrushev, et al.
Sberbank
44

We present an AutoML system called LightAutoML developed for a large European financial services company and its ecosystem satisfying the set of idiosyncratic requirements that this ecosystem has for AutoML solutions. Our framework was piloted and deployed in numerous applications and performed at the level of the experienced data scientists while building high-quality ML models significantly faster than these data scientists. We also compare the performance of our system with various general-purpose open source AutoML solutions and show that it performs better for most of the ecosystem and OpenML problems. We also present the lessons that we learned while developing the AutoML system and moving it into production.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/05/2019

Katallassos: A standard framework for finance

Katallassos is a new blockchain that provides a standard way to build an...
06/15/2019

Assessment of Urban Ecological Service value used in Urban Rail Transit Project

Ecosystem services refer to the ones human beings often obtain from the ...
11/14/2019

Arguing Ecosystem Values with Paraconsistent Logics

The valuation of ecosystem services prompts dialogical settings where no...
06/17/2019

The Evolving Ecosystem of Predatory Journals: A Case Study in Indian Perspective

Digital advancement in scholarly repositories has led to the emergence o...
07/27/2020

DICE: Dynamic Interconnections for the Cellular Ecosystem

To enable roaming of users, the cellular ecosystem integrates many entit...
08/17/2021

The Ecosystem Path to General AI

We start by discussing the link between ecosystem simulators and general...

1. Introduction

AutoML has attracted much attention over the last few years, both in the industry and academia (Guyon et al., 2019). In particular, several companies, such as H2O.ai111https://www.h2o.ai/, DataRobot 222https://www.datarobot.com/, DarwinAI 333https://darwinai.ca/ and OneClick.ai444https://www.oneclick.ai/, and existing AutoML libraries, such as AutoWeka (Thornton et al., 2013; Kotthoff et al., 2017), MLBox, AutoKeras (Jin et al., 2019), Google’s Cloud AutoML555https://cloud.google.com/automl/, Amazon’s AutoGluon (Erickson et al., 2020), IBM Watson AutoAI 666https://www.ibm.com/cloud/watson-studio/autoai and Microsoft Azure AutoML777https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml, have provided industrial solutions that automatically generate ML-based models. Most of these approaches produce general-purpose AutoML solutions that automatically develop ML-based models across a broad class of applications in financial services, healthcare, advertising, manufacturing and other industries (Guyon et al., 2019).

The key assumption of this horizontal approach is that the process of automated model development remains the same across all these applications. In this paper, however, we focus on developing a vertical AutoML solution suitable for the needs of the ecosystem (Pidun et al., 2019) of a large European financial services company comprising a wide range of banking and other types of financial as well as non-financial services, including telecommunications, transportation and e-commerce for the B2B and B2C sectors of the economy. We argue in the paper that such ecosystem has an idiosyncratic set of requirements for building ML models and would be better served by a domain specific AutoML solution, rather than using a generic horizontal AutoML system. In particular, our ecosystem has the following set of requirements:

  • AutoML system should be able to work with different types of data collected from hundreds of different information systems and often changes more rapidly than these systems can be fully documented using metadata and painstakingly preprocessed by data scientists for the ML tasks using ETL tools.

  • Many our models are typically build on large datasets, having thousands or tens of thousands of features and millions of records. This makes it important to develop fast AutoML methods that can handle such types of datasets efficiently.

  • The number of production-level models across our complex ecosystem is very large, measured in thousands, and continues to increase rapidly, forming a Long Tail in terms of their popularity and economic efficiency. This makes it necessary for the AutoML system to accurately build and maintain all these models efficiently and cost-effectively. Furthermore, besides building production models, it is necessary to build a very large number of models to validate numerous hypotheses tested across the entire ecosystem and do it effectively.

  • Many of our business processes are non-stationary and are rapidly changing over time what complicates the process of validating and keeping up to date the ML models that are included in these evolving processes. This means, among other things, the need to satisfy specific model validation needs, including out-of-time validation and validation of client behavioral model (models that take a sequence of single object states as input).

In this paper, we introduce a vertical type of AutoML, called LightAutoML

, which focuses on the aforementioned needs of our complex ecosystem and that has the following characteristics. First, it provides nearly optimal and fast search of hyperparameters, but does not optimize them directly, nevertheless making sure that it produces satisfycing 

(Simon, 1956)

results. Furthermore, we dynamically keep the balance between hyper-parameter optimization and speed, making sure that our solutions are optimal on small problems and fast enough on large ones. Second, we purposely limit the range of ML models to only two types, i.e., gradient boosted decision trees (GBMs) and linear models, instead of having large ensembles of multiple algorithms, in order to speed up LightAutoML execution time without sacrificing its performance for our types of problems and data. Third, we present a unique method of choosing preprocessing schemes for different features used in our models based on certain types of meta-statistics and selection rules.

We tested the proposed LightAutoML system on a wide range of open and proprietary data sources across a wide range of applications and demonstrated its superior performance in our experiments. Furthermore, we deployed LightAutoML in our ecosystem in numerous applications across five different platforms which enabled the company to save millions of dollars and present our experiences with this deployment and business outcomes. In particular, the initial economic effects of LightAutoML in these applications range from 3% to 5% of the total ML economic effects from deployed ML solutions in the company. Moreover, LightAutoML provided certain novel capabilities that are impossible for the humans to perform, such as generating massive amounts of ML models in record time in the non-stop (24-7-365) working mode.

In this paper, we make the following contributions. First, we present the LightAutoML system developed for the ecosystem of a large financial services company comprising a wide range of banking and other types of financial as well as non-financial services. Second, we compare LightAutoML with the leading general-purpose AutoML solutions and demonstrate that LightAutoML outperforms them across several ecosystem applications and on the open source AutoML benchmark OpenML (Gijsbers et al., 2019). Third, we compare performance of the LightAutoML models with those manually tuned by the data scientists and demonstrate that LightAutoML models usually outperform data scientists. Finally, we describe our experiences with deployment of LightAutoML in our ecosystem.

2. Related work

The early work on AutoML goes back to the mid-’90s, when the first papers on hyper-parameter optimization were published  (King et al., 1995; Kohavi and John, 1995). Subsequently, the concepts of AutoML were expanded and the interest in AutoML began growing significantly since the publication of the Auto-WEKA paper (Thornton et al., 2013) in 2013 and the organization of the AutoML workshop at ICML in 2014.

One of the main areas of AutoML is the problem of hyper-parameter search, where the best performing hyper-parameters for a particular ML model are determined in a large hyper-parameter space using iterative optimization methods (Bergstra et al., 2011)

. Another approach is to estimate the probability that a particular hyper-parameter is the optimal one for a given model using Bayesian methods that typically use historical data from other datasets and the previously estimated models on these datasets

(Swersky et al., 2013; Springenberg et al., 2016; Perrone et al., 2017). Other methods search not only in the space of hyper-parameters but also try to select the best models from the space of several possible modeling alternatives (Yang et al., 2019; Lee et al., 2018; Olson and Moore, 2019; Santos et al., 2019). For example, TPOT from (Olson and Moore, 2019)

generates a set of best performing models from Sklearn and XGboost and automatically chose the best subset of models. Moreover, other papers focus on the problem of automated deep learning model selection and optimization

(Lee et al., 2018; Zimmer et al., 2020). Finally, several papers propose various methods of automatic feature generation (Olson and Moore, 2019).

In addition to proposing specific approaches to AutoML described above, there has been a discussion in the AutoML community on what AutoML is and how to properly define it, with different authors expressing their points of view on the subject. In particular, while some approaches focus only on the modeling stage of the CRISP-DM model lifecycle, other approaches take a broader view of the process and cover other lifecycle stages. For example, according to Shubha Nabar from Salesforce, “most auto-ML solutions today are either focused very narrowly on a small piece of the entire machine learning workflow, or are built for unstructured, homogenous data for images, voice and language” 

(Nabar, 2018). Then she argues that the real goal of the AutoML system is the end-to-end approach across the whole CRISP-DM cycle that “transforms customer data into meaningful actionable predictions” (Nabar, 2018). A similarly broad view of AutoML is presented in (Guyon and Elisseeff, 2003) where it was argued that AutoML focuses on “removing the need for human interaction in applying machine learning (ML) to practical problems”. A similar argument for a broad view of AutoML as an end-to-end process taking the input data and automatically producing an optimized predictive model was presented in (MSV, 2018).

Furthermore, some successful examples of industry-specific AutoML solutions for medical, financial, and advertisement domains are reviewed in (Guyon et al., 2019). One particular application of AutoML in the financial sector is presented in (Agrapetidou et al., 2020)

, where a simple general-purpose AutoML system utilizing models of random forest, support vector machines, k-nearest neighbors with different kernels, and other ML methods has been created for the task of detecting bank failures. This system is an experimental proof-of-concept focusing on a narrow task of bank failures, rather than an industrial-level AutoML solution designed for a broad class of financial applications.

In this paper, we focus on a broader approach to AutoML, which includes the stages of data processing, model selection, and hyper-parameter tuning. This is in line with other popular approaches to AutoML incorporated into systems such as AutoGluon (Erickson et al., 2020), H2O (LeDell and Poirier, 2020), AutoWeka (Thornton et al., 2013), TPOT (Olson and Moore, 2019)

, Auto-keras

(Jin et al., 2019), AutoXGBoost (Thomas et al., 2018), Auto-sklearn (Feurer et al., 2019), Amazon SageMaker (Das et al., 2020).

3. Overview of LightAutoML

In this section, we describe high level structure and implementation details of LightAutoML, an open source modular AutoML framework that can be accessed at our GitHub repo888https://github.com/sberbank-ai-lab/LightAutoML. LightAutoML consists of modules that we call Presets and that are focused on the end-to-end model development for typical ML tasks. Currently, LightAutoML supports the following four Preset modules. First, TabularAutoML Preset focuses on classical ML problems defined on tabular datasets. Second, WhiteBox

Preset solves binary classification task on tabular data using simple interpretable algorithms, such as Logistic Regression over discretized features and

Weight of evidence (WoE) (Zeng, 2014) encoding. This is a commonly used approach to model probability of client default in banking applications because of interpretability constrains posed by the regulators and the high costs of loan approval for bad customer.

Third, NLP Preset is the same as Tabular, but is also able to combine tabular pipelines with the NLP tools, such as specific feature extractors or pre-trained deep learning models. The last CV Preset implements some basic tools to work with image data. In addition, it is also possible to build custom modules and Presets using LightAutoML API. Some examples are also available in our GitHub repository and on Kaggle Kernels999https://www.kaggle.com/simakov/lama-custom-automl-pipeline-example. Although LightAutoML supports all four Presets, only TabularAutoML is currently being used in our production-level system. Therefore, we will focus on it in the rest of this paper.

A typical LightAutoML pipeline scheme is presented in Figure 1, each pipeline containing:

  • Reader: object that receives raw data and task as input, calculates some useful metadata, performs initial data cleaning and decides about data manipulations that should be done before fitting different model types.

  • LightAutoML inner datasets that contains metadata and CV iterators that implements validation scheme for the datasets.

  • Multiple ML Pipelines that are stacked (Ting and Witten, 1997) and/or blended (averaged) via Blender to get a single prediction.

Figure 1. Main components of LightAutoML Pipeline

Smth2

An ML pipeline in LightAutoML is one or multiple ML models that share a single data preprocessing and validation scheme. The preprocessing step may have up to two feature selection steps, a feature engineering step or even just be empty if no preprocessing is needed. The ML pipelines can be computed independently on the same datasets and then blended together using averaging (or weighted averaging). Alternatively, a stacking ensemble scheme can be used to build multi level ensemble architectures.

In the next section we focus only on TabularAutoML Preset since it is the main Preset considered in this paper. We also compare it with other popular open source AutoML frameworks.

3.1. LightAutoML’s Tabular Preset

TabularAutoML is the default LightAutoML pipeline that solves three types of tasks on tabular data: binary classification, multiclass classification, and regression

for various types of loss functions and performance metrics. The input data for TabularAutoML is a table containing four types of columns: numeric features, categorical features, timestamps and a single target column with continuous values or class labels.

The key features of our LightAutoML pipeline are:

  • Strong baseline: works good on most datasets

  • Fast: no metamodels or pipeline optimization

  • Advanced data preprocessing comparing to the other popular open source solutions

One of our main goals in designing LightAutoML was to make a tool for fast hypothesis testing. Therefore, we avoid using brute-force methods for optimal pipeline search and focus only on the models and efficient techniques that work across a wide range of datasets. In particular, we train only two classes of models represented by three types of algorithms in the following order: linear model with L2 penalty, the lightgbm version of the GBM method (Ke et al., 2017) and the catboost version of GBM (Dorogush et al., 2018).

The selected order matters here because it helps to manage time if a user sets the time limit. The algorithm’s order of training is ranked by the time they usually spend to train. Therefore, we can guarantee that if the time limit is set with a reasonable value, at least the fastest model will be computed on large datasets. On the other hand, if the prior time estimation becomes too conservative, fast completion of the previous models helps to re-estimate and free more time for training and hyperparameter tuning of the slower ones.

More traditional ML algorithms were selected for the LightAutoML system because, despite the trend in the development of neural networks for different domains, GBM-based methods show strong performance results on tabular data and outperform other approaches in many benchmarks and competitions at the moment. Furthermore, various GBM frameworks are commonly used in industry to develop production models

(Ke et al., 2017; Dorogush et al., 2018; Chen and Guestrin, 2016). In addition, linear models are fast, easy to tune and can boost performance of tree-based models in ensembles by adding variety of predictions (Breiman, 1996). In comparison, other popular open source AutoML frameworks usually use significantly more classes of models and therefore take more time to train. As we show in Section 4.1, the proposed approach described in this section, having a few additional features, is able to outperform existed solutions on both internal datasets used in our company and also on the OpenML benchmark (Gijsbers et al., 2019).

3.2. Data preprocessing and auto-typing

As we start initially to develop LightAutoML as the application for the ecosystem, we focus our work a lot on the data preprocessing part. As already mentioned in Section 1 we should be ready to work with datasets in different formats, scales, containing artifacts, NaNs, or unspecified user processing.

To handle different types of features in different ways we need to know each feature type. In the case of a single task with a small dataset user can specify it manually, but it becomes a real problem in the case of hundreds of tasks with datasets that contain thousands of features. This is very typical for bank applications, and it takes an hours of work of data scientists to perform this data analysis and labeling. So now we as AutoML framework need to solve problem of automatic data type inference (auto-typing). In case of TabularAutoML Preset we need to label features into 3 classes: numeric, category and datetime. There is one simple and obvious solution to map column array data types to actual feature types as: float/int to numeric; timestamp or string, that could be parsed as a timestamp to datetime; others to category.

However, this mapping is not the best way to handle this problem because occurance of numeric data types in category column values is very common case. ML based decision of this problem is described in (Shah and Kumar, 2019), where different models over meta statistic are used to predict human feature type labeling. Deep learning is also used to solve similar, but slightly different problem of semantic data type detection in (Hulsebos et al., 2019).

We solve this problem in a slightly different way – let’s say, that category column is column, for which category encoding techniques, such as target encoder (OOFEnc) (Micci-Barreca, 2001; Dorogush et al., 2018) or frequency encoder (FreqEnc)

, that encodes category by the number of occurrence in train sample, perform better than numeric ones, such as raw or discretized by quantiles

(QDiscr) values. Building a lot of models to verify performance of all encoding combinations becomes impractical, so we need some proxy metric, that is simple to calculate and agnostic for type of LightAutoML task, including given loss and metric functions. We choose Normalized Gini Index (Chen et al., 1982) between target variable and encoded feature as measure of encoding quality, because it estimates the quality of sorting and could be computed for both classification and regression tasks. The specifics of the auto-typing algorithm are presented in Appendix B (as Algorithm 1). Final decision is made by 10 expert rules over estimated encoding qualities and some other meta statistics such as number of unique values. Exact list of rules can be viewed in the LightAutoML repository.

Note that we did not use ML models for producing auto-typing because, as it was mentioned before, our goal was not to predict human labeling but guess what actually will be better for final model performance, and get this type of ground truth labeling is impossible at the moment. Some times LightAutoML auto-typing prediction may differ a lot from human’s point of view, but it may lead to significant boost in the performance, see datasets guilermo, amazon_employee and robert in Table 7, that contains a lot of categories from auto-typing point of view, and Section 4.2.

After we infer type of candidate feature, additionally we can guess optimal way to preprocess it, for example select best category encoding or define, if we should discretize numbers. Similar algorithm could be used for that purpose with small adaptation by using different rules and encoding methods.

3.3. Validation schemes

As was mentioned earlier, data in industry may rapidly change over time in some ecosystem processes, which made independent identically distributed (IID) assumption irrelevant in model development. There are cases when time series based, grouped or even some custom validation splits are required. This becomes important because validation in AutoML is used not only for performance estimation but also for hyperparameters search and out-of-fold prediction generation. Out-of-fold prediction is used for blending and stacking models on upper LightAutoML levels and also returned as prediction on train set for user analysis.

To the best of our knowledge, other popular AutoML frameworks use only classical KFold or random holdout approaches, while advanced validation schemes help to handle non IID cases and make models more robust and stable in time. This problem is out of scope of OpenML benchmark tasks, but becomes actual in production applications. Currently available validation schemes in TabularAutoML are:

  • KFold cross-validation. is used by default (including stratified kfold for classification tasks or GroupKFold for behavioral models if group parameter for folds splitting is specified).

  • Holdout validation, if holdout set specified.

  • Custom validation schemes (including time series split (Cerqueira et al., 2020) cross validation).

All models that are trained during cross validation loop on different folds are then saved for inference phase. Inference on new data is made by averaging models from all train folds.

3.4. Feature selection

Feature selection is a very important part of industrial model development, because it’s a very efficient way to reduce model implementation and inference costs. However, existing open source AutoML solutions are not focused much on this problem. In opposite, TabularAutoML implements 3 strategies of feature selection: No selection, Importance cut off selection (by default), Importance based forward selection.

Feature importance could be estimated in 2 ways: split based tree importance (Lundberg and Lee, 2017) or permutation importance (Altmann et al., 2010) of GBM model. Importance cutoff selection aims to reject only features that are useless for the model (importance measure ¡= 0). This strategy helps to reduce the number of features with no performance decrease, which may speed up model training and inference.

However, user may want to limit the number of features in the final model or to find a minimal possible model even with small performance drop in order to reduce inference costs significantly. For that purposes we implement a variant of the classical forward selection algorithm described in (Guyon and Elisseeff, 2003) with the key difference of ranking the candidate features by the aforementioned importance measure that helps to significantly speed up the procedure. The specifics of the algorithm are provided in Appendix C (as Algorithm 2).

We show in Table 1 on internal datasets that it is possible to build much faster and simpler models with slightly lower scores.

Strategy Avg ROC-AUC Avg Inference Time (10k rows)
Cut off 0.804 9.1078
Forward 0.7978 5.6088
Table 1. Comparison of different selection strategies on binary bank datasets.

3.5. Hyperparameter tuning

In TabularAutoML we use different ways to tune hyper-parameters depending on what is tuned:

  • Early stopping to choose the number of iteration (trees in GBM or gradient descent steps) for all models during training phase

  • Expert system

  • Tree structured Parzen estimation (TPE) for GBM models

  • Grid search

All hyperparameters are tuned in order to maximize defined by user (or default for solved task) metric function

3.5.1. Expert system

One simple way to set hyperparameters for models quickly in a satisficing (Simon, 1956) fashion are expert rules. TabularAutoML can initialize a “reasonably good” set of GBM hyperparameters, such as learning rate, subsample, columns sample, depth, fast depending on task and dataset size. Suboptimal parameter choice is partially compensated by adaptive selection of the number of steps with the early stopping. It prevents final model from high score decrease comparing to hard tuned models.

3.5.2. TPE and combined strategy

We introduce mixed tuning strategy that works by default in TabularAutoML (but it might be changed by user anyway): for each GBM framework (lightgbm and catboost) we train 2 type of models. First one will get expert hyperparameters and second will be fine-tuned, while it fits into time budget. TPE algorithm, described in (Bergstra et al., 2011) is used for model fine-tuning. This algorithm was chosen because it shows state of the art results in tuning this class of models. We use realization of TPE by Optuna framework (Akiba et al., 2019). In final model both models will be blended or stacked. Also one of models (or even both) could be dropped from AutoML pipeline if it doesn’t help to increase the final model performance. We compare this combined strategy with AutoML based on expert system only, results are described in Section 4.2.

3.5.3. Grid search

Grid search parameter tuning is used in TabularAutoML pipeline to fine-tune regularization parameter of linear model in combination with:

  • Early stopping. We assume regularization parameter in linear model has single optimal point and after reaching it we can finish search.

  • Warm start model weights between regularization values trials (Friedman et al., 2010). It helps to speed up model training.

Both heuristics makes grid search efficient to fine-tune linear estimators.

3.6. Model ensembling in TabularAutoML

3.6.1. Multilevel stacking ensembles

As was mentioned before, LightAutoML allows user to build stacked ensembles of unlimited depth. Similar strategy is common for AutoML systems and also used in (LeDell and Poirier, 2020; Erickson et al., 2020). However in practice building ensembles deeper that 3 levels shows no effect.

TabularAutoML builds two level stacking ensembles by default for multi-class classification task only, because it was the only case where we observe significant and stable boost in model performance, that is shown in Section 4.2. This behavior is just the default setting and can be changed by user to perform stacking on any type of dataset.

3.6.2. Blending

Last level of LightAutoML ensemble, regardless of it’s depth, may contain more than 1 model. To combine that model’s prediction into single AutoML output, predictions are passed to the blending phase. Blending in terms of LightAutoML differs from full stacker model in the following ways. First, it’s much simpler model. As the consequence of it, second, it doesn’t require any validation scheme to tune and control overfitting. And the last, it is able to perform model selection to simplify ensemble for speed up inference phase.

TabularAutoML uses weighted averaging as blender model. Ensemble weights are estimated with coordinate descent algorithm in order to maximize defined or default for task metric function. Models with close to 0 weights are dropped from AutoML.

3.6.3. Ensemble of AutoMLs and time utilization

As we mentioned before, one of our goals was to limit search space of ML algorithms to speed up model training and inference, what is related to production use cases of AutoML framework for middle and high dataset sizes. However this strategy will not perform well in case such as ML competitions when user has very high time budget and needs to utilize it all in order to get the best performance regardless to it cost. Increasing search space and brute forcing will often take an advantage here.

Typical example here are small datasets from OpenML benchmark that are solved by TabularAutoML much faster than given time limit in 1 hour, Table 2. In order to solve this problem and be competitive on benchmarks and ML competitions in case of small datasets we implement time utilization strategy that blends multiple TabularAutoMLs with slightly different settings and validation random seeds. User may define multiple config settings and priority order or use defaults. AutoMLs with same settings and different seeds will be simple averaged together and after that different settings ensembles will be weighted averaged. This strategy shows performance boost on OpenML tasks, see Section 4.2.

Task type Utilized Single run
Binary (9 smallest) 3268 360
Multi class (7 smallest) 2984 1201
Table 2. Average training time in seconds for smallest OpenML datasets for Utilized Tabular Preset version and Default (single run).

4. Performance of LightAutoML

4.1. Comparison with open source AutoML

In this section, we compare the performance of TabularAutoML Preset of LightAutoML against the already existing open source solutions across various tasks and show the superior performance of our method. First, to do this comparison, we use datasets from OpenML benchmark that is typically used to evaluate the quality of AutoML systems. The benchmark is evaluated on 35 datasets of binary and multi-class classification tasks. The full experimental description, including the framework versions, limitations, and extended results is presented in Appendix A. The summary of the performance results of LightAutoML vis-a-vis five popular AutoML systems is presented in Table 3, where all the AutoML systems are ranked by the total amount of wins in each dataset.

framework Wins Avg Rank Avg Reciprocal Rank
autoweka 0 5.7879 0.1747
autogluon 0 4.2647 0.252
autosklearn 3 2.6 0.4505
tpot 6 3.8235 0.374
h2oautoml 6 2.4857 0.4833
lightautoml 20 1.9429 0.7233
Table 3. Aggregated framework comparison on OpenML.

However, the detailed comparison of frameworks in the context of dataset groups provided in Table 4 shows that LightAutoML does not work equally well on all the classes of tasks. For binary classification problems with a small amount of data, LightAutoML shows average performance results and losses to TPOT; moreover, it performs on par with H2O and autosklearn. The reason for this is that the tasks with a small amount of data are not common in our ecosystem and were not the main impetus behind the development of LightAutoML.

Small Small Medium Medium
Framework binary multiclass binary multiclass
autoweka 0.1741 0.1667 0.1783 0.1786
tpot 0.6056 0.481 0.2061 0.2333
autogluon 0.2796 0.2071 0.2606 0.2476
autosklearn 0.4519 0.3571 0.4424 0.5417
h2oautoml 0.4907 0.5595 0.4697 0.4271
lightautoml 0.4481 0.6786 0.8939 0.8375
Table 4. Average reciprocal rank for frameworks comparison by OpenML datasets groups.

Another type of datasets for comparing different solutions are internal datasets collected in the bank. In this study, we use 15 bank datasets for various binary classification tasks performed in our company, such as credit scoring (probability of defaults estimation), collection, marketing (response probability). As the main goal of developing the LightAutoML framework was to work with our internal applications, we expected better performance of our system on the internal data. In Table 5 we show that the performance gap between LightAutoML and other AutoML systems is significantly higher on the bank datasets than on the OpenML data101010Note that we cannot present information about these internal datasets in this paper because they contain proprietary confidential information..

framework Wins Avg Rank Avg Reciprocal Rank
autoweka 0 5.7333 0.18
autogluon 0 3.9333 0.2778
tpot 1 3.9333 0.3056
autosklearn 1 3.2667 0.3956
h2oautoml 1 2.6667 0.4244
lightautoml 12 1.4667 0.8667
Table 5. Aggregated framework comparison on bank’s proprietary datasets.

4.2. Ablation study

To estimate each TabularAutoML feature impact on the OpenML benchmark results, we perform the ablation study estimating average reciprocal rank change. We take the best existing AutoML configuration, including the time utilization strategy, combined hyperparameter tuning, auto-typing, stacking for multi-class tasks only (Utilized best) as baselines. First, we turn off time utilization to estimate contribution of multi-start bagging (Default) described in Section 3.6.3. Second, we take Default and compare the two-level stacking for all the tasks (Stacked all) and blending first level models only (Single level all) to estimate the quality of alternative ensembling methods discussed in Section 3.6.1. Third, we exclude the advanced auto-typing module presented in Section 3.2 from Default (No auto-typing). Finally, we replace the combined tuning strategy described in Section 3.5.2 from Default with the expert system initialization only (No finetune). The ablation study results are presented in Table 6. As Table 6 demonstrates, each feature removal decreases the LightAutoML rank which shows that all those features make our framework more accurate than the others on OpenML datasets.

Configuration Avg Reciprocal Rank Avg Rank
No finetune 0.6054 2.4118
Typing off 0.6431 2.1765
Single level all 0.65 2.2353
Stacked all 0.6672 2.2059
Default (stacked multiclass) 0.6907 2.0882
Utilized best 0.7233 1.9429
Table 6. Ablation study on OpenML.

4.3. LightAutoML vs. building models by hand

We have also used our LightAutoML system as one of the “participants” in the internal hackathon in our ecosystem, together with 433 leading data scientists in the company. The training dataset used in the competition had 300 features and 400,000 records, and the goal was to predict the churn rates. The performance metric selected for this hackathon was ROC-AUC, and the performance of the baseline used in this competition was ROC-AUC = 75.5%.

LightAutoML was presented in the hackathon by 4 participants that used it in different configurations. As Figure 2 demonstrates, LightAutoML outperformed the baseline model. Although the average performance of LightAutoML (ROC-AUC = 76.54) was better than the average performance of the top-10% of hackathon participants, i.e., belonging to the 90% quantile (average ROC-AUC = 76.08), the performance improvements were not statistically significant. This means that although LightAutoML significantly outperformed the average data scientist in this hackathon (including the “bottom 90%”), its performance was comparable to the top-10% of the best data scientists. Detailed results are accessible at repo111111https://github.com/sberbank-ai-lab/Paper-Appendix.

Figure 2. Performance Results of LightAutoML in the NextHack Competition vs. 433 Human Competitors.

5. Deploying LightAutoML

In this section, we present our experiences with developing, piloting and moving LightAutoML into production.

Deployment. Currently, the LightAutoML system has been deployed in production by five large ML platforms inside our financial services company and its ecosystem, including cloud, B2B, and B2C platforms. Furthermore, seven more divisions are currently piloting the latest version of the system. Moreover, it is also used in several automated systems and various IT services across the ecosystem. Altogether, more than 70 teams involving several hundred data scientists use LightAutoML to build ML models across the entire ecosystem. As an example, just the B2C platform alone has more than 300 business problems that are solved using LightAutoML this year, resulting in the total P&L increase by 3%. Next, we present some examples of successful deployments of LightAutoML in the ecosystem and our experiences with these deployments.

Operational audit. LightAutoML has been applied to the problem of operational audit of the bank branches with the goal to detect mistakes made by bank’s employees across the organization and do it in the most effective and efficient manner. These mistakes are of numerous types, depending on the type of branch, its location, the type of employee who made the mistake, etc. The goal of the operational audit is to detect and correct all these mistakes, prevent their future occurrences, and minimize their consequences according to the established practices of the bank.

Figure 3. Difference in revenues of the LightAutoML and the manually developed impact estimation methods.

In this project, we focused on 60 major types of mistakes and developed one predictive model per mistake and each of the 11 divisions of the bank, resulting in 660 LightAutoML models in total. For comparison, prior mistake detection methods were rule-based.

One of the reasons for the bank not developing prior ML-based mistake detection models was the large number of such models (660 in our case) requiring extensive resources of data scientists and a long time to produce all of them (measured in person-years), making such project infeasible. The second reason why such models have not been previously developed is that the economic effect from each model is limited. For example, Figure 3 shows the 660 operational risk models sorted on the y-axis by the difference in the economic impact between the LightAutoML models and the previously existing rule-based methods, ranging from $60,000 highest positive impact on the left to the $8,000 highest negative impact on the right of Figure 3. This example demonstrates that detecting even the most important operational mistakes resulted in limited savings, and the savings from the medium and minor mistakes were considerably lower, making development of such machine learning models economically infeasible. Nevertheless, the cumulative economic effect of building all the 660 models is very significant, bringing the bank millions of dollars in savings.

All the 660 operational audit models were developed by our LightAutoML system in 3 days over the weekend, and the whole project took 10 person-days, taking into account data preparation costs. This contrasts sharply with the cost of creating so many models manually, resulting in saving the bank millions of dollars.

Other Examples of LightAutoML Applications. LightAutoML system has also automatically built fraud detection models that were subsequently compared against the manually developed classical ML-based fraud detection models previously developed and deployed in the company. These models saved development time by 40 person-days and improving model performance by 6% for the F1 performance metric, identifying several thousands of fraudulent activities, and thus saving the bank millions of dollars.

Another example of successful LightAutoML deployment was a charitable donations system involving 110 different types of contributions focusing on children’s welfare. Our LightAutoML system developed a model identifying the donors over a period of two person-days and increased the number of donations by 18% and the total sum of donations by 40% just for the email channel alone when it was moved into production.

Lessons Learned. We will start with the lessons learned when developing and piloting

LightAutoML for our ecosystem. First, applications that are the most amenable to deployment of AutoML are those where the prediction problem is formulated precisely and correctly and does not require human insights based on deep prior knowledge of the application domain. Second, AutoML is well suited for typical supervised learning tasks, where testing works well on retrospective data. It is harder to effectively apply the current generation of AutoML systems to non-standard problems. Third, AutoML solutions are especially useful when an ML model should be built quickly with limited resources. In those cases when there is plenty of time to build a model using top data science talent working on complex problems requiring significant insights and ingenuity, non-standard solutions, and careful fine-tuning of the model parameters, humans can outperform AutoML solutions. As an example, when Google conducted the IEEE competition

(Rishi, 2019), its AutoML solution outperformed 90% of the competing teams in the first two weeks. However, humans eventually managed to catch up and outperform Google’s AutoML after the first two weeks by putting extensive effort into the hackathon and constantly improving their models. Finally, strong performance results of LightAutoML were achieved not due to certain unique breakthrough features of our system but because of numerous incremental improvements described in Section 3 that were cohesively combined into a unified solution.

Furthermore, while moving LightAutoML into production we have learned a different group of lessons. First, independent tests of our system by data scientists in the company show that LightAutoML significantly outperformed humans on only one-third of the ML tasks actually deployed in production, which differs from the 90% figure reported in Section 4.3 for the pilot studies. This reduction of LightAutoML performance vs. humans in production environments is due to the following reasons:

  • Presence of data leaks, i.e., the situation when information about the object target becomes obvious from features for the training and not for the test data, and also of data artifacts, such as special decimal separators or character symbols (e.g., K standing for thousands) in numeric columns, missing values for a specific feature block, etc., across hundreds of automated data storage systems.

  • Existence of extremely small number of minority class events, compared with the number of features.

  • In the same conditions, top data scientists build many more models than less experienced ones at the same time.

The last point demonstrates that, instead of replacing data scientists with AutoML systems, it is better to complement them with such systems, and we focus now on empowering our DSes with LightAutoML in our company. In particular, we are primarily using LightAutoML as a baseline generator and as a fast hypothesis testing tool in the company now. This helps our data scientists to focus on certain crucial parts of the model development process, such as appropriate time period data selection, target variable formulation, selection of suitable quality metrics, identification of business constraints, and so on.

The second lesson is associated with the importance of integrating LightAutoML with different production environments of our diverse ecosystem to implement end-to-end solutions. Although we observed 4x to 10x model training time reduction in comparison to the usual model creation process, overall time-to-market managed to decrease by only 30% on average for the whole model life-cycle process. Furthermore, we observed that this number can be improved to almost 70% for the cases when continuous integration with data sources and inference environments was done, notably on our cloud platform.

In summary, we had encountered several issues when piloting our LightAutoML system and moving it into production, most of these issues having to deal with idiosyncratic requirements of the financial services industry and the diverse ecosystem of our organization. We managed to resolve them successfully by developing the LightAutoML system to suit the needs of our ecosystem. We have also described the lessons learned while moving LightAutoML into production across a diverse class of ML applications. All this makes a strong case for developing vertical AutoML solutions for the financial services industry and for our LightAutoML in particular.

6. Conclusion

In the paper, we present the LightAutoML system designed to satisfy the specific needs of large financial services companies and their ecosystems. We argue for the need to develop a special-purpose AutoML, as opposed to the general-purpose system, such as H2O or AutoGluon, that would satisfy the idiosyncratic needs of such organizations, including the ability to handle large datasets having a broad range of data types, non-stationary data, specific types of validations, including behavioral models and out of time validations, and rapid development of a large number of models. The proposed LightAutoML system has several incremental improvements, including the ”light-and-fast” approach to AutoML development when only GBMs and linear models are used and novel and fast combined hyperparameter tuning method proposed that produces strong tuning results, advanced data preprocessing including auto-typing, that collectively enhanced functionality of LightAutoML and helped it to achieve superior performance results.

Further, we show in the paper that our LightAutoML system outperforms some of the leading general-purpose AutoML solutions in terms of the AUC-ROC and LogLoss metrics on our proprietary applications and also on the OpenML benchmarks, as well as the models manually developed by data scientists for the typical problems of importance to large financial organizations.

Finally, the proposed LightAutoML system has been deployed in production in numerous applications across the company and its ecosystem, which helped to save the organization millions of dollars in development costs, while also achieving certain capabilities that are impossible for the humans to realize, such as the generation of massive amounts of ML models in record time. We have also described several important lessons that we have learned while developing and deploying the LightAutoML system at the company, including that (a) the ”light” approach to AutoML deployed in the LightAutoML system worked surprisingly well in practice, achieving superior performance results - mainly due to the careful integration of various incremental improvements of different AutoML features properly combined into the unified LightAutoML system; (b) realization that LightAutoML outperformed data scientists on only one third of the deployed models, as opposed to the expected 90% of the cases - due to the complexities and the ”messiness” of the actually deployed vis-a-vis the pilot cases; (c) realization that it is not always true that data scientists outperform the machines when preparing the data to be used for building ML models - LightAutoML outperformed data scientists in this data preparation category in several use cases; (d) although LightAutoML significantly improved model building productivity in our organization, the number of data scientists actually increased significantly in our company over the last year - mainly due to the realization that we need many more ML models to run our business better, and there is plenty of work for both AutoML and the data scientists to achieve our business goals.

As a part of the future work, we plan to develop functionality related to model distillation and strengthen the work with NLP tasks. In particular, some applications in our organization, including e-commerce, impose additional constraints on the real-time performance of ML models, and we need the make sure that the distillation component of LightAutoML satisfies these real-time requirements. Furthermore, we plan to further enhance NLP functionality of LightAutoML in its future releases.

References

  • A. Agrapetidou, P. Charonyktakis, P. Gogas, T. Papadimitriou, and I. Tsamardinos (2020) An automl application to forecasting bank failures. Applied Economics Letters 28 (1), pp. 1–5. Cited by: §2.
  • T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2623–2631. Cited by: §3.5.2.
  • A. Altmann, L. Toloşi, O. Sander, and T. Lengauer (2010) Permutation importance: a corrected feature importance measure. Bioinformatics 26 (10), pp. 1340–1347. Cited by: §3.4.
  • J. Bergstra, R. Bardenet, B. Kégl, and Y. Bengio (2011) Implementations of algorithms for hyper-parameter optimization. In NIPS Workshop on Bayesian optimization, pp. 29. Cited by: §2, §3.5.2.
  • L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §3.1.
  • V. Cerqueira, L. Torgo, and I. Mozetič (2020) Evaluating time series forecasting models: an empirical study on performance estimation methods. Machine Learning 109 (11), pp. 1997–2028. Cited by: 3rd item.
  • C. Chen, T. Tsaur, and T. Rhai (1982) The gini coefficient and negative income. Oxford Economic Papers 34 (3), pp. 473–478. Cited by: §3.2.
  • T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §3.1.
  • P. Das, N. Ivkin, T. Bansal, L. Rouesnel, P. Gautier, Z. Karnin, L. Dirac, L. Ramakrishnan, A. Perunicic, I. Shcherbatyi, et al. (2020) Amazon sagemaker autopilot: a white box automl solution at scale. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning, pp. 1–7. Cited by: §2.
  • A. V. Dorogush, V. Ershov, and A. Gulin (2018) CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363. Cited by: §3.1, §3.1, §3.2.
  • N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola (2020) Autogluon-tabular: robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505. Cited by: Appendix A, §1, §2, §3.6.1.
  • M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter (2019) Auto-sklearn: efficient and robust automated machine learning. In Automated Machine Learning, Cited by: §2.
  • J. Friedman, T. Hastie, and R. Tibshirani (2010) Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 (1), pp. 1. Cited by: 2nd item.
  • P. Gijsbers, E. LeDell, J. Thomas, S. Poirier, B. Bischl, and J. Vanschoren (2019) An open source automl benchmark. arXiv preprint arXiv:1907.00909. Note: Accepted at AutoML Workshop at ICML 2019 Cited by: §1, §3.1.
  • I. Guyon and A. Elisseeff (2003) An introduction to variable and feature selection. Journal of machine learning research 3 (Mar), pp. 1157–1182. Cited by: §2, §3.4.
  • I. Guyon, L. Sun-Hosoya, M. Boullé, H. J. Escalante, S. Escalera, Z. Liu, D. Jajetic, B. Ray, M. Saeed, M. Sebag, et al. (2019) Analysis of the automl challenge series 2015–2018. In Automated Machine Learning, pp. 177–219. Cited by: §1, §2.
  • M. Hulsebos, K. Hu, M. Bakker, E. Zgraggen, A. Satyanarayan, T. Kraska, Ç. Demiralp, and C. Hidalgo (2019) Sherlock: a deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1500–1508. Cited by: §3.2.
  • H. Jin, Q. Song, and X. Hu (2019) Auto-keras: an efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1946–1956. Cited by: §1, §2.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) Lightgbm: a highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pp. 3146–3154. Cited by: §3.1, §3.1.
  • R. D. King, C. Feng, and A. Sutherland (1995) Statlog: comparison of classification algorithms on large real-world problems.

    Applied Artificial Intelligence an International Journal

    9 (3), pp. 289–333.
    Cited by: §2.
  • R. Kohavi and G. H. John (1995) Automatic parameter selection by minimizing estimated error. In Machine Learning Proceedings 1995, pp. 304–312. Cited by: §2.
  • L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown (2017) Auto-weka 2.0: automatic model selection and hyperparameter optimization in weka. The Journal of Machine Learning Research 18 (1), pp. 826–830. Cited by: §1.
  • E. LeDell and S. Poirier (2020) H2o automl: scalable automatic machine learning. In 7th ICML workshop on automated machine learning, Cited by: §2, §3.6.1.
  • K. M. Lee, K. S. Hwang, K. I. Kim, S. H. Lee, and K. S. Park (2018) A deep learning model generation method for code reuse and automatic machine learning. In Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems, pp. 47–52. Cited by: §2.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in neural information processing systems, pp. 4765–4774. Cited by: §3.4.
  • D. Micci-Barreca (2001) A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3 (1), pp. 27–32. Cited by: §3.2.
  • J. MSV (2018) External Links: Link Cited by: §2.
  • S. Nabar (2018) External Links: Link Cited by: §2.
  • R. S. Olson and J. H. Moore (2019) TPOT: a tree-based pipeline optimization tool for automating. Automated Machine Learning: Methods, Systems, Challenges, pp. 151. Cited by: §2, §2.
  • V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau (2017)

    Multiple adaptive bayesian linear regression for scalable bayesian optimization with warm start

    .
    arXiv preprint arXiv:1712.02902. Cited by: §2.
  • U. Pidun, M. Reeves, and M. Schüssler (2019) Do you need a business ecosystem?. Boston Consulting Group 7. Cited by: §1.
  • D. Rishi (2019) External Links: Link Cited by: §5.
  • A. Santos, S. Castelo, C. Felix, J. P. Ono, B. Yu, S. R. Hong, C. T. Silva, E. Bertini, and J. Freire (2019) Visus: an interactive system for automatic machine learning model building and curation. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, Cited by: §2.
  • V. Shah and A. Kumar (2019) The ml data prep zoo: towards semi-automatic data preparation for ml. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pp. 1–4. Cited by: §3.2.
  • H. A. Simon (1956) Rational choice and the structure of the environment.. Psychological review 63 (2), pp. 129. Cited by: §1, §3.5.1.
  • J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter (2016) Bayesian optimization with robust bayesian neural networks. In Advances in neural information processing systems, pp. 4134–4142. Cited by: §2.
  • K. Swersky, J. Snoek, and R. P. Adams (2013) Multi-task bayesian optimization. In Advances in neural information processing systems, pp. 2004–2012. Cited by: §2.
  • J. Thomas, S. Coors, and B. Bischl (2018) Automatic gradient boosting. arXiv preprint arXiv:1807.03873. Cited by: §2.
  • C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 847–855. Cited by: §1, §2, §2.
  • K. Ting and I. Witten (1997) Stacking bagged and dagged models. Cited by: 3rd item.
  • C. Yang, Y. Akimoto, D. W. Kim, and M. Udell (2019) OBOE: collaborative filtering for automl model selection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: §2.
  • G. Zeng (2014) A necessary condition for a good binning algorithm in credit scoring. Applied Mathematical Sciences 8 (65), pp. 3229–3242. Cited by: §3.
  • L. Zimmer, M. Lindauer, and F. Hutter (2020)

    Auto-pytorch tabular: multi-fidelity metalearning for efficient and robust autodl

    .
    arXiv preprint arXiv:2006.13799. Cited by: §2.

Appendix A Experiment design and additional results

Datasets and experiment design was taken from official OpenML benchmark page. All datasets was evaluated across 10 cross validation folds made by organizers and final score for each dataset was calculated by averaging scores from all 10 folds. Models were scored by roc-auc metric for binary classification tasks and logloss for multi class classification tasks. In this paper we drop from evaluation 4 from 39 existed datasets, because the most of framework failed on that datasets due to timeout with given limitations.

Models were evaluated with the following limitations: 1 hour runtime limit (that limit was passed to the framework as the input parameter, if framework supports time limitations, but actually process was killed after 2 hours), 8 CPU, 32 GB RAM per single cross validation split. Each fold was evaluated in separated docker container on cloud server under OS Ubuntu 18.04 with HDD and Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz.

Frameworks and versions that was compared in the benchmark: lightautoml==0.2.8, h2o_automl==3.32.0.1, autogluon==0.0.12, autosklearn==0.11.1, autoweka==2.6, tpot==0.11.5. All code for benchmark evaluation was taken from OpenML repository121212https://github.com/openml/automlbenchmark, were it was published by frameworks developers. Finally, we have only 4 failure cases due to timeout. Results across all 35 datasets are shown in Table 7.

Code to reproduce our experiment is available in the repository131313https://github.com/sberbank-ai-lab/automlbenchmark/tree/lightautoml. Code to reproduce lightautoml results was also published in OpenML repository.

Note: as mentioned in (Erickson et al., 2020), AutoGluon shows state of art results on OpenML benchmark, but unfortunately we were not able to reproduce the results using code, published at OpenML repository. Possible reason for that are non default run settings or differences in computational environments.

Also during benchmarks process, we encountered various errors in the tested frameworks, most of them are listed in Appendix in [11].

Internal datasets comparison was made on the same environment, except the time limit - we set limit in two hours for inner data. Datasets contain clients information, that can not be published, so only aggregated frameworks comparison could be presented. Datasets were splitted independently into train/test samples by data owners, depending on their business tasks. Split may be random, out-of-time or separated by group values (for example by client IDs for behavioral models), split methods and test samples were unknown to AutoMLs during training phase. Models were trained on train parts and scored by roc-auc metric values on test parts.

Appendix B Auto-typing algorithm

Input : Integer and float type train features , train target , set of expert rules
Output : Boolean values for each feature in to be numeric
1 for each feature in  do
2       ;
3       ;
4       ;
5       ;
6       for  in  do
7             if  then
8                   Break;
9                  
10             end if
11            
12       end for
13      
14 end for
Algorithm 1 Splitting integer and float features into categorical and numeric

Appendix C Permutation based forward selection algorithm

Input : Train features , train target , valid features , valid target , MLAlgo , features block size , metric
Output : Selected features
1 ;
2 ;
3 Sort by descending ;
4 ;
5 ;
6 for  from to with step  do
7       ;
8       ;
9       ;
10       if  then
11             Append to ;
12             ;
13            
14       end if
15      
16 end for
Algorithm 2 Importance based forward selection
dataset metric lightautoml autogluon h2oautoml autosklearn autoweka tpot
australian roc-auc 0.9462 0.9393 0.934 0.9353 0.9337 0.9336
blood-transfusi… roc-auc 0.7497 0.719 0.758 0.75 0.7282 0.7479
credit-g roc-auc 0.7921 0.7766 0.7968 0.7756 0.7526 0.7824
kc1 roc-auc 0.8283 0.8168 0.8374 0.8404 0.8166 0.844
jasmine roc-auc 0.8806 0.8822 0.887 0.8826 0.8638 0.8897
kr-vs-kp roc-auc 0.9997 0.9994 0.9997 0.9999 0.981 0.9998
sylvine roc-auc 0.9882 0.9852 0.9882 0.9896 0.9729 0.9923
phoneme roc-auc 0.9655 0.9682 0.9668 0.9634 0.9552 0.9693
christine roc-auc 0.8307 0.8133 0.8247 0.8285 0.7905 0.8065
guillermo roc-auc 0.9322 0.9027 0.9078 0.9064 0.8901 0.8943
riccardo roc-auc 0.9997 0.9997 0.9997 0.9998 0.9981 0.9906
amazon_employee… roc-auc 0.9003 0.8758 0.8756 0.8524 0.8363 0.8674
nomao roc-auc 0.9976 0.9954 0.9959 0.9958 0.9826 0.9948
bank-marketing roc-auc 0.9401 0.9371 0.9373 0.938 0.8103 0.9314
adult roc-auc 0.9306 0.9286 0.9295 0.93 0.914 0.925
kddcup09_appete… roc-auc 0.8509 0.7932 0.8305 0.8383 - 0.8111
apsfailure roc-auc 0.9936 0.9915 0.9924 0.9921 0.9678 0.9904
numerai28.6 roc-auc 0.5306 0.5212 0.5311 0.5294 0.5249 0.5235
higgs roc-auc 0.8157 0.8055 0.815 0.8137 0.676 0.8024
miniboone roc-auc 0.9876 0.9842 0.9862 0.9865 0.9651 0.982
car -logloss -0.0038 -0.1337 -0.0033 -0.0023 -0.2216 -0.0001
cnae-9 -logloss -0.1555 -0.2917 -0.2002 -0.1784 -0.8589 -0.1448
connect-4 -logloss -0.3358 -0.4956 -0.3387 -0.3481 -3.1088 -0.4411
dilbert -logloss -0.0327 -0.1473 -0.0411 -0.0431 -0.2455 -0.0772
fabert -logloss -0.765 -0.7715 -0.764 -0.7705 -7.4649 -0.8263

fashion-mnist

-logloss -0.2519 -0.3321 -0.3025 -0.2852 -0.6423 -0.3726
helena -logloss -2.5548 - -2.8187 -2.6359 -15.3287 -
jannis -logloss -0.6653 -0.7275 -0.7232 -0.6696 -4.1177 -0.7319
jungle_chess_2p… -logloss -0.1428 -0.381 -0.2366 -0.1599 -2.0565 -0.2829
mfeat-factors -logloss -0.0823 -0.1563 -0.0941 -0.093 -0.542 -0.1022
robert -logloss -1.3166 -1.6828 -1.6829 -1.6571 - -1.9923
segment -logloss -0.0464 -0.0854 -0.0497 -0.0615 -0.4275 -0.0522
shuttle -logloss -0.0008 -0.0008 -0.0005 -0.0005 -0.0059 -0.0006
vehicle -logloss -0.3723 -0.4812 -0.3584 -0.3816 -2.5381 -0.3745
volkert -logloss -0.8283 -0.9197 -0.8669 -0.8054 -1.7296 -0.9945
Table 7. Detailed result of evaluation on OpenML datasets.