Adaptation Strategies for Automated Machine Learning on Evolving Data

06/09/2020 ∙ by Bilge Celik, et al. ∙ 0

Automated Machine Learning (AutoML) systems have been shown to efficiently build good models for new datasets.However, it is often not clear how well they can adapt when the data evolves over time. The main goal of this study is to understand the effect of data stream challenges such as concept drift on the performance of AutoML methods, and which adaptation strategies can be employed to make them more robust. To that end, we propose 6 concept drift adaptation strategies and evaluate their effectiveness on different AutoML approaches. We do this for a variety of AutoML approaches for building machine learning pipelines, including those that leverage Bayesian optimization, genetic programming, and random search with automated stacking. These are evaluated empirically on real-world and synthetic data streams with different types of concept drift. Based on this analysis, we propose ways to develop more sophisticated and robust AutoML techniques.



There are no comments yet.


page 7

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of automated machine learning (AutoML) aims to automatically design and build machine learning systems, replacing manual trial-and-error with systematic, data-driven decision making [Yao2018]. This makes robust, state-of-the-art machine learning approaches accessible to a much broader range of practitioners and domain scientists.

Although AutoML has been shown to be very effective and even rival human machine learning experts [Thornton2013, Feurer2015, Olson2016, H2O2019], they are usually only evaluated on static datasets [Yao2018]. However, data is often not static, it evolves. As a result, models that have been found to work well initially may become sub-optimal through the passage of time. AutoML techniques are often computationally expensive, hence frequently re-running them from scratch may not be feasible. Moreover, most AutoML techniques assume that earlier evaluations are forever representative of new data, and may therefore fail to adapt to new data.

In online (or stream) learning, methods need to handle continually incoming data, achieve good predictive performance in a limited time window, and be able to detect and adapt to the changes in the underlying distribution, also known as concept drift [Gomes2017, Oza2001, Bifet2010]. However, there has been very limited work on automating the design of machine learning pipelines in settings with concept drift. That would require that the AutoML process, the automated search for optimal pipelines, is adapted so that it can cope with concept drift and adjust the pipelines over time, whenever the data changes so drastically that a pipeline redesign or re-tuning is warranted.

To truly understand the effects of concept drift and how to cope with it in AutoML settings, we systematically study how different AutoML techniques are affected by different types of concept drift. We discover that while these AutoML methods often perform very similarly on static datasets [Gijsbers2019b]

, they each respond very differently to different types of concept drift. We propose six different adaptation strategies to cope with concept drift and implement these in open-source AutoML libraries. We find that these can indeed be effectively used, paving the way towards novel AutoML techniques that are robust against evolving data.

The paper is organized as follows. First, we explain the problem more formally in Section 2. Section 3 covers related work as well as the most relevant AutoML systems and online learning techniques. Our adaptation strategies for AutoML methods are described in Section 4. Section 5 details our empirical setup to evaluate these strategies, and the results are analysed in Section 6. Section 7 concludes.

2 Problem Definition

In this section, we formalize AutoML and concept drift.

2.1 AutoML

AutoML consists of several optimization problems that need to be solved to design a series of operations that transform raw data into desired outputs, such as a machine learning pipeline or a neural architecture [Vanschoren2018]. In this paper, we are concerned with optimizing machine learning pipelines, which can be formally defined as a combined algorithm selection and hyperparameter optimization (CASH) problem [Thornton2013]. Given a training set with targets , and a set of algorithms

, each with a space of hyperparameters

, the CASH problem is defined as finding the algorithm and hyperparameter settings

that minimize a loss function

(e.g. misclassification loss) applied to a model trained on , and evaluated on a validation set , e.g. with k-fold cross validation:


Instead of a single learning algorithm, can be a full pipeline include data preprocessing, feature preprocessing, meta-learning and model post-processing, which typically creates a large search space of possible pipelines and configurations thereof, resulting in important trade-offs between computational efficiency and accuracy [Yao2018].

2.2 Online learning and concept drift

In online learning, models are trained and used for prediction without having all training data beforehand. The data stream should be considered to be infinitely long, has to be processed in order, and only a small part of it can be in memory at any given time [Gama2014]. Usually, the data points are processed in small batches, also called windows. The process that generates the data may change over time, leading to concept drift, a usually unpredictable shift over time in the underlying distribution that generates data.

Consider a data stream

generated from a joint probability density function

, also called the concept that we aim to learn. Concept drift can be described as:


where and represent joint probability functions at time and , respectively [Gama2014]. Webb et al [Webb2016] further categorize concept drift into 9 main classes based on several quantitative and qualitative characteristics, of which the duration and magnitude of the drift have the greatest impact on learner selection and adaptation. Drift duration is the amount of time in which an initial concept (, at time ) drifts to a resulting concept (, at time ):


Abrupt (sudden) drift is a change in concept occurring within a small time window :


Drift magnitude is the distance between the initial and resulting concepts over the drift period :


where is a distribution distance function that quantifies the difference between concepts at two points in time. Webb et al [Webb2016] argues that magnitude will have a great impact on the ability of a learner to adapt to the drift. A minor abrupt drift may require refining a model, whereas major abrupt drift may require abandoning the model completely.

In gradual drift, the drift magnitude over a time period is smaller than a maximum difference between the concepts:


It is crucial to understand the dynamics of concept drift, especially its effect on the search strategy used by the AutoML technique, in order to design a successful adaptation strategy. Drift detection algorithms (e.g. DDM [Gama2004]) are a vital part of these strategies. They are used to determine the existence and location of a drift to alarm the learner so that it can react in a timely manner.

3 Related Work

To the best of our knowledge, there has been very little work that aims to understand how different AutoML techniques perform on evolving data and how to adapt them. There exists interesting work on speeding up hyperparameter optimization by transfer learning from prior tasks

[Golovin2017], or continual learning [de2019continual]

where the goal is to adapt (deep learning) models to new tasks without catastrophic forgetting, but these don’t consider concept drift in single tasks. In the online learning literature there exist many techniques for handling different types of concept drift

[Gama2014], but little automated guidance on how to select the best techniques. Interesting early results were obtained using meta-learning over different streams [gama2011learning]. More recently, hyperparameter tuning techniques have been proposed which re-initiate hyperparameter tuning when drift is detected [veloso2018self, carnein2019towards]. However, these are tied to specific optimization techniques for single algorithms while we aim to generally adapt AutoML techniques that optimize entire pipelines. There also exist strategies to adapt models previously learned by online learning algorithms [Bakirov2018] to new data, usually through ensembling, but these are not used to re-optimize the algorithms or pipelines themselves.

We do note that certain specific AutoML techniques can likely be adapted to handle concept drift better, as we will also explore in Section 4. For instance, there exists work that enables Bayesian models to detect changepoints and therefore cope with concept drift [garnett2010learning, garnett2010sequential]

, which could potentially be used to adapt Bayesian Optimization techniques, but to the best of our knowledge this has not been previously explored. A recent AutoML challenge focused on concept drift and found that most current AutoML systems could not adapt to concept drift in very limited time frames, being outperformed by incremental learning techniques, especially gradient boosting, with clever preprocessing


Most similar to our work is a recent study by Madrid et al [Madrid2019], in which a specific AutoML method (autosklearn) is extended with concept drift detection and two model adaptation methods. Our work significantly deepens and improves upon this research. We evaluate a range of very different AutoML techniques on very different types of data streams, and analyze the empirical results to understand how these AutoML approaches are each affected by evolving data in their own way, as well as how efficiently they handle the continuous flow of data, and what can be done to develop novel AutoML techniques more suited to evolving data. In addition, we propose and evaluate five different adaptation techniques that radically differ in their resource requirements and how they train candidate models.

In the remainder of this section, we will discuss the AutoML and online learning techniques used in this study.

3.1 AutoML techniques and systems

Bayesian Optimization (BO) is one of the most successfully used AutoML approaches in the literature [Brochu2010]. To efficiently explore the large space of possible pipeline configurations, it trains a probabilistic surrogate model on the previously evaluated configurations to predict which unseen configurations should be tried next. The trade-off in this process is between exploitation of currently promising configurations versus exploration of new regions. In Sequential Model-Based Optimization (SMBO) configurations are evaluated one by one, each time updating the surrogate model and using the updated model to find new configurations. Popular choices for the surrogate model are Gaussian Processes, shown to give better results on problems with fewer dimensions and numerical hyperparameters [Snoek2012]

, whereas Random Forest-based approaches are more successful in high-dimensional hyperparameter spaces of a discrete nature

[Feurer2015]. Another popular BO technique uses

Tree-structured Parzen Estimators (TPE)

, which are more amenable to parallel evaluation of configurations [Bergstra2011].

Evolutionary computation offers a very different approach. For instance, pipelines could be represented as trees and genetic programming can be used to cross-over and/or mutate the most promising pipelines to evolve them further, growing in complexity as needed [Olson2016].

Another popular technique to search the space of possible pipelines is to simply use random search. While less sample-efficient, it can be easily parallellized and/or combined with other general strategies. Such strategies include multi-fidelity optimization techniques [Vanschoren2018], which first try many configurations on small samples of the data, only evaluating the best ones on more data. Second, ensembling techniques such as voting or stacking can combine many previously trained configurations, and correct for over- or underfitting, respectively. Finally, it is often beneficial to use meta-learning to build on information gained from previous experiments, for instance to warm-start the search with the most promising configurations, or to design a smaller search space based on prior experience [Vanschoren2018]. For a wide survey of AutoML techniques, see [Vanschoren2018].

In this study, we will study the impact of concept drift on each of these approaches (BO, evolution, and random search). Since we need to implement adaptation strategies to better cope with concept drift, we select one open-source, state-of-the-art AutoML system for each of these.

3.1.1 Autosklearn

Autosklearn [Feurer2015]

is an AutoML system using Bayesian optimization (SMBO). It supports warm-starting by initiating the search with pipelines that worked well on similar prior data sets, as well as a greedy ensemble selection technique to build ensembles out of the different pipelines tried. It produces pipelines consisting of a wide range of classifiers and pre-processing techniques from scikit-learn


3.1.2 H2O AutoML

H2O AutoML performs random search in combination with automated stacked ensembles [H2O2019]. The stacking technique can be Random Forests, Gradient Boosting Machines (GBMs), Deep Neural Nets, or generalized linear models (GLMs). They are used to build a stacked ensemble containing the best of all base models, as well as one including only the best performing model from each algorithm family.

3.1.3 Gama

GAMA uses genetic programming to automatically generate machine learning pipelines [Gijsbers2019]. It is similar to TPOT [Olson2016], but uses asynchronous rather than synchronous evolution, and includes automated ensembling as well.

3.2 Online Learning

Online adaptive learning predicts newly arriving samples with the current model, evaluates the current performance after receiving the labels for these samples, and updates (e.g. retrains) the model when needed.

3.2.1 Forgetting mechanisms

In order to adapt better to the changing environment, irrelevant past data should be forgotten. Forgetting mechanisms include sliding windows where data samples are disregarded at a constant rate, data impact decaying with a dynamic rate depending on the existence of concept drift, or removing data samples based on the class distribution. The most appropriate forgetting mechanism depends on the environment and drift characteristics. In this work we will explore strategies using both sliding windows and concept drift detection.

3.2.2 Online learning algorithms

Although many stream learning algorithms use a single learner (e.g. Hoeffding Trees [Domingos2000]), ensemble learners (e.g. SEA [Street2001], Blast [vanRijn2016]) are commonly used because of their higher accuracy and faster adaptation to concept drift. Oza’s online version of bagging [Oza2001] performs a majority vote over the base models. Leveraging Bagging [Bifet2010] adds more input and output randomization to the base models to increase ensemble diversity at the cost of computational efficiency. Blast [vanRijn2016] builds a heterogeneous set of base learners and selects the learner that performed best in the previous window to make predictions for samples in the current window. Finally, online boosting performs boosting by weighting the base learners based on their prior performance and adjusting these weights over time [Oza2005].

3.2.3 Evaluation

The most common procedures to evaluate online learning algorithms are holdout, prequential, and data-chunk evaluation. Since the data in a stream must be processed in their temporal order, cross validation is not applicable [Gama2014]. In prequential evaluation (or interleaved test-then-train), newly arriving instances are first used to calculate the performance (test) and then for updating the model (train) in the next iteration. Accuracy changes incrementally as new instances arrive to the data stream. Although accuracy can improve (but also fluctuate) over time, this method is expected to perform poorly at the beginning of learning process and shown to have lower accuracy than holdout on average [Rijn2014]. It is, however, very useful to evaluate models in case of concept drift.

A middle-way approach that combines holdout and prequential evaluation is data chunk evaluation, which uses data chunks of size instead of individual instances to apply the test-then-train paradigm. Unlike individual prequential evaluation, this approach doesn’t punish algorithms for early mistakes and training time can be measured [Bifet2009]. A critical aspect is determining the size of data chunks which is related to the drift type and computational cost.

4 Adaptation Strategies

Since AutoML methods usually optimize a objective function based on a static training set (see Eq. 1), they require several modifications to be applicable to online learning with concept drift. We call these modifications adaptation strategies because they can be generally applied to adapt AutoML methods. Madrid et al [Madrid2019] previously introduced two such strategies: global model replacement, where the current predictive model is replaced with a new one trained on new data when drift is detected, and model management, where only the weights of the base models in autosklearn’s final ensemble are updated based on their performance. The latter is specific to AutoML systems that build ensembles.

4.1 Strategy definitions

In this paper, we propose and evaluate six different adaptation strategies, visualized in Fig. 1. In each of the following strategies, the AutoML algorithm is run at least once at the beginning with the initial batch of data. For the forgetting mechanism we use a fixed length sliding window, where the last s batches are fed into training the learner and the rest of the data is forgotten. The hyperparameter s can be chosen depending on the application: lower values imply lower memory requirements and faster adaptation to concept drift, while higher values imply larger training sets and less frequent re-optimizations of the pipelines.

Fig. 1: Adaptation strategies
AS-0 Detect & Increment

This strategy includes a drift detector that observes changes in performance measures and uses this to detect concept drift. AutoML is run on the first batch of data to build a first model (Model-A). However, we limit the AutoML search space to pipelines with incremental learning algorithms, i.e. learners that can continue to train on new data, such as gradient boosting. As long as there is no drift in the data, Model-A is used to predict all upcoming batches. However, when drift is detected, the pipelines are trained incrementally with the latest s (sliding window) batches. The configuration of the pipelines remains the same. If the AutoML classifier is an ensemble, the base pipelines are each trained incrementally and their weights in the ensemble are kept to their original values. This strategy assumes that the initial pipeline configuration will remain useful and only the models need to be updated in case their performance dwindles because of concept drift. It tests whether keeping a learner memory of past data is beneficial, since this can give a performance boost if the learner can adapt to the new data. Training incrementally is usually faster than training from scratch.

AS-1 Detect & Retrain

This strategy is similar, but runs AutoML without restricting the search space and retrains the pipelines from scratch on the latest s batches when drift is detected. The original pipeline configurations and ensemble weights are kept. This strategy also assumes that the initially found pipeline configuration will remain useful and only the models need to be retrained in case of concept drift. It eliminates the need for rerunning the (expensive) AutoML techniques with every drift, but may be less useful if the drift is so large that a pipeline redesign is needed.

AS-2 Detect & Warm-start

After drift is detected, the AutoML technique is re-run to find a better pipeline configuration. Instead of starting from scratch, however, the AutoML is rerun with a ’warm start’, i.e. starting the search from the best earlier evaluated pipeline configurations. This strategy assumes that the initial pipeline configurations are no longer optimal after concept drift and should be re-tuned. Using a warm start mechanism can lead to faster convergence to good configurations, hence working better under small time budgets. However, in case of sudden drift the previously best pipelines may be misleading the search.

AS-3 Detect & Restart

After drift is detected, the AutoML technique is re-run from scratch, with a fixed time budget, to find a better pipeline configuration. This is identical to global model replacement in [Madrid2019] or the hyperparameter optimization in [veloso2018self, carnein2019towards]. This strategy also assumes that the pipelines need to be re-tuned after drift. Rerunning the AutoML from scratch is more expensive, but could result in significant performance improvements in case of significant drift.

AS-4 Periodic Restart

Similar to ’Detect & Restart’, except that instead of using a drift detector, the AutoML method is restarted with each new batch, or some other regular interval. This strategy test whether a drift detector is actually necessary, and whether it is worth to retrain at certain intervals in spite of the significant computational cost.

AS-5 Train once

This is a baseline strategy added for comparison purposes. AutoML is run once at the beginning and the resulting model is used to test each upcoming batch. The advantages are simplicity of the design and time savings. This tests the innate capabilities of AutoML methods to cope with concept drift.

4.2 Integration in AutoML Systems

To evaluate the utility of these adaptation techniques, we have implemented them in Auto-sklearn, H2O and GAMA, representing AutoML approaches based on Bayesian optimization, random search, and evolution, respectively.

Drift detection. The first four adaptation strategies require a drift detector. We apply the Early Drift Detection Method (EDDM) [BaenaGarcia2006] which is an improvement over the widely used DDM method [Gama2004] that better detects gradual drift. EDDM assumes that concept drift occurs when the learner’s misclassifications are spaced closer together over time. After every misclassified sample , it computes the average distance between misclassifications

and the standard deviation

in the current sequence (or batch). It also records value , the 95% point of the distribution, and its peak value . Drift occurs when misclassifications are spaced more closely than usual:


The threshold is suggested to be set to 0.95.

# Strategy H2O Adaptation Autosklearn Adaptation GAMA Adaptation

0 Detect & Increment - Restrict the base models to GBM and RF, tune the hyperparameters with RandomGridSearch for the space defined in h2o.automl settings. - Train a StackedEnsemble on the best pipelines using the initial batch. - When drift is detected, use the checkpoint option to train the base models incrementally on new data with the same hyperparameters. - Retrain the stacked ensemble with the updated models on the latest data, with the same hyperparameters. - Restrict the search space to GBM and RF for classifiers with option warm-start = True and fit AutoML on the initial batch. - Use the Refit function to retrain the pipelines retrieved from the current ensemble on the latest data. - Restrict the search space to GBM and RF for classifiers with option warm-start = True and fit AutoML on the initial batch. - Fit the pipelines in the ensemble to the latest data by retraining the model (automl.model) retrieved from the previous AutoML run with Use the same hyperparameters.
1 Detect & Retrain - Run AutoML on the initial batch. - When drift is detected, retrain the best model (leader). If this is a stacked ensemble, get the base models (with get_params()[’base_models’]) and retrain each of them on the new batches. - Retrain the stacked ensemble with the updated models on the new data. - Run AutoML on the initial batch. - Use the refit function to fit the pipelines in the ensemble to the latest data by retraining the ensemble model retrieved from the initial AutoML run. - Run AutoML on the initial batch. - Fit the pipelines in the ensemble on the latest data by retraining the model (automl.model) retrieved from the initial AutoML run with
2 Detect & Warm-start Warm-starting would mean retrieving the best pipelines from the previous RandomGridSearch and re-evaluating them first in the next RandomGridSearch and then retrain the StackedEnsemble on the best pipelines. Sadly this could not be implemented or simulated in the current system. - Use option parallel usage with manual process spawning to create a shared output directory between batches. Set Ensemble_size to 0 to separate ensemble building and model training. - Warm-start the Bayesian optimization with the best previous pipelines. Keep 1/3 of the budget for building the ensemble (fit_ensemble). Ensemble_size and ensemble_nbest parameters are kept at their default values. Use the warm start feature of the classifier fit function. This causes the evolutionary search algorithm to start from the best previous pipelines.
3 Detect & Restart Use the drift detector after each tested batch. Retrain AutoML from scratch with a fixed time budget on the new data when drift is detected.
4 Periodic Restart Re-run the AutoML after every batch with a fixed time budget.
5 Train once No change. Only run the AutoML methods on the first batch.
TABLE I: Implementation details for the AutoML method adaptations.

Adapting AutoML techniques The remaining non-trivial adaptations required for each system are summarized in Table I. First of all, none of the AutoML methods come with a built-in test-then-train procedure. We therefore perform data chunk evaluation by dividing the data into n batches and feeding them one by one in arriving order to the method for prediction/testing first, and training afterwards. None of the methods cover online learning algorithms. For Detect & Increment we restricted the search space to gradient boosting (GBM) and random forest (RF) models, which are supported by all methods and allow incremental learning. Detect & Warm-start could sadly not be implemented or simulated in H2O in its current form.

5 Experimental Setup

We describe the data streams used to evaluate the adaptation mechanisms and the setup of the AutoML systems. To ensure reproducibility, all these data streams are publicly available at OpenML [Vanschoren2014] together with results of several experiments with different algorithms.111Search for datasets tagged ‘concept_drift‘. We also provide a github repository with our code and empirical results, including many plots that could not be fitted in this paper.222

5.1 Data streams

We selected 4 well-known, real classification data streams and generated 15 artificial ones with different drift characteristics. The latter are generated using the MOA framework [Bifet2011], and include gradual, sudden and mixed (gradual & sudden) drift. Within each drift type, the drift magnitude is changed by changing underlying distribution functions, and in some cases a certain amount of artificial noise is added.

[font= ]


[Gomes2017] is a time series on flight delays with drift at daily and weekly intervals [Webb2018]. It has 539,383 instances and 9 numeric and categorical features.


[Harries1999] is a time series on electricity pricing that has been shown to contain different kinds of drift [Webb2018]. It has 45,312 instances, 7 features, and 2 classes.


[Read2012] is a text data stream with data from the Internet Movie Database. An often-used task in drift research is to predict whether a movie belongs to the ”drama” genre. It has 120,919 movie plots described by 1,000 binary features.


[Duarte2004] includes sensor data from a wireless sensor network for moving vehicle classification. It has 98,528 instances of 50 acoustic and 50 seismic features.

Streaming Ensemble Algorithm (Sea)

[Street2001] is a data generator based on four classification functions. We created data streams of 500,000 instances and 3 numerical features with concept drift. Abrupt drift is created by changing the underlying classification function at instance 250k. The drift window, , is the number of instances through which the drift happens. 5 abrupt drift data streams were generated by switching between different functions with and introducing 10% label noise. We also generate mixed drift streams by additionally adding two gradual drifts (before and after the abrupt drift) with . 5 mixed drift data streams were generated with different magnitudes of sudden and gradual drift, without adding noise.

Rotating Hyperplane

[Hulten2001] is a stream generator that generates d-dimensional classification problems in which the prediction is defined by a rotating hyperplane. By changing the orientation and position of the hyperplane over time in a smooth manner, we can introduce smooth concept drift. We created a 5 gradual drift data streams with different drift magnitudes within a window of 100k. The data streams contain 500,000 instances with 10 features. 5% noise is added by randomly changing the class labels.

The data streams are divided into batches of equal size to simulate an online environment and given to the algorithms in arriving order, one batch at a time. We choose a batch size of 1000 to provide enough data for the AutoML algorithms and minimize the effect of accuracy fluctuations caused by small batches. For AS-4, the batch size is larger (20,000 instances) since it does not include a drift detector and requires retraining with every batch, which is much more expensive. We apply data chunk evaluation to evaluate the adapted AutoML systems. For training, the fixed length of sliding window is chosen to be 3 for each data stream.

5.2 Algorithms

We evaluate the adaptations of Autosklearn, GAMA and H2O. As baselines, we added Oza Bagging [Bifet2009] and Blast [vanRijn2016]. Both are state-of-the-art ensemble methods specifically designed for data streams with concept drift. Below, we explain how each algorithm is configured. Hyperparameters that are not mentioned are used with their default setting.

[font= ]


[Feurer2015] The memory limit for learning algorithms is increased to 12GBs to avoid memory errors. The time budget for each run is the default 1 hour.


[Gijsbers2019] The option to build an ensemble out of the trained machine learning pipelines is activated. Maximum total time for each run is set to 1 hour.


[H2O2019] All settings are kept at their default values, including the default stacker, a Generalized Linear Model (GLM). The maximum run time set to 1 hour.

Oza Bagging

[Oza2001] from the scikit-multiflow library. The window size is set to the batch size, 1000, and the pretrain size is also set to 1000. This way, training and prediction happens in the same was as for the AutoML methods, allowing fairer comparison.


[vanRijn2016] The hyperparameters used are the defaults in the MOA library, and the window size is set to 1000, similar to the AutoML implementation.

6 Results

We evaluate these algorithms on all datastreams, and analyze the effects of drift characteristics, adaptation strategies and AutoML approaches.

6.1 Synthetic data streams

We first analyze the effect of two important drift characteristics: drift type and drift magnitude. Fig. 2-2 demonstrate the results of artificial data streams with gradual, abrupt and mixed drift types, respectively. Each subgraph shows the results of one AutoML library, and each series shows the accuracy for a specific adaptation strategy (except AS-4) after every batch. They show the results for the highest drift magnitude (for clearer analysis). Results for different levels of drift magnitude are shown in Fig. 2 for the GAMA adaptations. Also see Fig. 4 and 5: they contain the same results, but averaged over 20 batches (hence less noisy), and also including AS-4.

Fig. 2: Accuracy over time for artificial data streams: LABEL: HYPERPLANE - High gradual drift; LABEL: SEA - High abrupt drift; LABEL: SEA - High mixed drift, and LABEL: SEA - Abrupt drift for different magnitudes (1 is lowest magnitude), for GAMA. For high gradual drift (a), H2O with AS-0 results are limited to the first 377 batches because of memory issues.

6.1.1 Effect of Drift Type: High Gradual Drift

As shown in Fig. 2, OzaBagging is dominated by several AutoML adaptations. AS-1 (Detect & Retrain) and 3 (Detect & Restart) perform well in both GAMA and Autosklearn, but not for H2O. This may be due to the GLM stacker overfitting on the smallish batch sizes. AS-0 (Detect & Increment) works quite well for all AutoML techniques. AS-2 (Detect & Warm-start) works well for GAMA, but not for Autosklearn, possibly because the Bayesian surrogate model is misled by starting with configurations which are no longer good. Note that Blast is on par with the best options in GAMA and Autosklearn, and better than any option in H2O.

6.1.2 Effect of Drift Type: High Abrupt Drift

As shown in Fig. 2, high abrupt drift affects all AutoML models, and only some strategies manage to recover for some of the AutoML techniques. GAMA is the quickest to recover when warm-started (AS-2) or when the models are retrained (AS-1). Restarting AutoML on the new data (AS-3) works for both GAMA and Autosklearn but takes longer to recover. Incremental learning (AS-0) never fully recovers. Oza Bagging recovers immediately, yet its overall accuracy is lower than the adequate AutoML strategies. Blast is equally unaffected and nearly on par with the recovered AutoML strategies.

6.1.3 Effect of Drift Type: High Mixed Drift

Fig. 2 shows that in case of high mixed drift, restarting the AutoML on new data (AS-3) performs best overall. Compared to the abrupt drift data stream (Fig. 2, AutoML methods handle the sudden drift better with smaller drop in accuracy and a faster shift to recovery phase. One explanation for this can be the existence of gradual drift prior to and after the sudden drift alarming the drift detectors periodically and causing more frequent retraining. Oza Bagging and Blast handle the drift well, yet overall fail to match the performance of the best AutoML strategies (AS-3 and AS-1), especially after the sudden drift point. As in Fig. 2, AS-1 doesn’t work well for H2O and AS-2 doesn’t work well for Autosklearn, likely for the same reasons.

Fig. 3: Accuracy over time for the real data streams: LABEL: Airlines; LABEL: Electricity; LABEL: IMDB; and, LABEL: Vehicle

6.1.4 Effect of drift magnitude

Fig. 2 shows the results of GAMA library for different levels of drift magnitude on the SEA abrupt drift data stream. As the drift magnitude increases, the performance drop also increases, but the overall recovery period decreases. This is most likely due to the early triggering of the drift detector in high magnitude cases. The best adaptation strategy changes with the drift magnitude, as previously suggested in [Madrid2019]. Retraining the models (AS-1), however, generally recovers faster than restarting the AutoML (AS-3). Warm-starting (AS-2) sometimes recovers faster, but not always. The latter may depend on exactly how the data drifts. If the concept underlying the data is still somewhat similar after the drift, i.e., it can be modelled with similar pipeline configurations, then warm-starting will speed up recovery, but if not, warm-starting could also be unhelpful.

6.2 Real Data Streams

The results on the real data streams are shown in Fig. 3-3. While each AutoML library fluctuates within a similar range of accuracy values, some differences can be observed between the adaptation strategies. AS-0 doesn’t work well for Airlines and IMDB data streams (Fig. 3 and 3), which suggests that with real-world concept drift, pipelines should be either retuned (AS-2/3) or retrained from scratch (AS-1) after concept drift. Oza Bagging is dominated in most data streams by several AutoML options. Blast matches the performance of the AutoML methods on Airlines and IMDB, outperforms them on Electricity, but performs generally worse than Autosklearn and GAMA for Vehicle. Also on Vehicle, H2O with AS-1 performs badly, as we’ve seen earlier with high gradual drift.

6.3 Effects of adaptation strategies

Figures 4 and 5 plot the accuracy of all algorithms for artificial data streams where the results are averaged over 20 batches. This smoothens the results and allows us to compare directly with AS-4. A first observation is that, irrespective of the type of drift, restarting AutoML during the stream, either when drift occurs (AS-3) or at regular intervals (AS-4), is better than not restarting at all (AS-5). However, while AS-3 and AS-4 perform well overall, they are usually only slightly better than simply retraining the models without re-optimizing them (AS-1), and AS-1 requires much less retraining time (also see Table I). AS-1 does assume that the pipeline configured on the first batch fits the characteristics of the later batches of data, and in a few cases (e.g. with abrupt drift in Fig. 4(b)) this can lead to worse performance, while AS-3 and AS-4 recover faster.

Although AS-0 (Detect & Increment) performs well with H2O and GAMA with different data streams, it is not at the level of AS-1 or AS-3, especially with abrupt drift. AS-2 (Detect & Warm-start) often works well for GAMA, but not for Autosklearn, likely because the current implementation initializes the Bayesian optimization too strongly.

Fig. 4: Mean accuracy over time for artificial data streams: LABEL: SEA - High abrupt drift; and LABEL: SEA - High mixed drift
Fig. 5: Mean accuracy over time for HYPERPLANE - High gradual drift

Batch size. The last strategy AS-4 (Periodic Restart), requires a lot more training time and possibly leads to more volatile behavior, which may be a disadvantage in certain data streams. This is why we increased the size of each batch and consequently decreased the number of retrainings. For abrupt and mixed drifts (Fig. 4 and 4 respectively), its performance is quite similar to the options that include a drift detector, even slightly better and recovering faster in mixed drift after the sudden drift point. For gradual drift (Fig. 5), on the other hand, AS-4 fails to reach the level of drift detector options in all libraries for almost every batch. This might be due to overfitting to temporary conditions occurring in continuous drift whereas drift detectors only trigger when the drift if significant enough. With the additional computational burden of retraining with every batch and lower accuracy, this strategy seems less useful in online learning scenarios.

Drift detection. In order to better understand the reasons behind the performance differences, we also compared the drifts detected within each option and the resulting changes in performance. Fig. 6 shows the drift points (as vertical lines) for each of the strategies with a drift detector in GAMA for a mixed drift data stream. With strategy AS-3 (AutoML restart), many more drifts are detected a higher accuracy is reached. This is expected since AutoML is trained from scratch each time which may result in bigger fluctuations in accuracy between batches, hence triggering the drift detector more often. AS-2 (Warm start) shows fewer performance fluctuations, hence fewer drifts are detected.

Fig. 6: Accuracy over time with drifts detection points for SEA - High mixed data stream on each adaptation strategy.

6.4 Effects of the time budget

Fig. 7: Mean accuracy for SEA - High mixed drift data stream with different time budgets between 1 and 60 minutes for AS-3.

Fig. 7 shows how accuracy is affected if we change the time budget for the AutoML methods when restarting AutoML on drift (AS-3). Smaller budgets do affect performance, but overall most methods work well under these constraints. Table II shows the accuracy and total (train + predict) time for AS-1 (Detect & Retrain) and AS-3 (Detect & Restart) as measured for H2O as well as Blast and Oza Bagging. The timings for other AutoML methods are similar since they use the same time budgets. Blast, which also uses multiple models in an ensemble, has similar accuracy and time performance. An advantage of the AutoML methods is that the time budget can be restricted without much loss in accuracy, which allows various online learning settings.

Total Time
- mean
- std
AS-1 (1 hour budget) 3242.14 0.63 0.03
AS-3 (1 hour budget) 5271.49 0.60 0.03
Blast 4997.46 0.63 0.04
Oza Bagging 886.76 0.55 0.09
AS-1 (1 min budget) 425.57 0.61 0.03
AS-3 (1 min budget) 591.75 0.57 0.02
TABLE II: Time vs accuracy comparison for IMDB data and H2O.

6.5 Comparison of AutoML systems

Fig. 8: Accuracy over time with drifts detection points for SEA - High abrupt data stream on adaptation strategy -3.

In this study, we experimented with 6 different adaptations of 3 AutoML systems to understand whether there exists certain combinations that are robust to different drift characteristics. On the real-world data streams, each library performs quite similarly, with accuracy changing within the same range with comparable fluctuations. For the data streams with abrupt concept drift, H2O was not as adaptive as Autosklearn and GAMA, at least not in its current form. It is possible that this can be resolved by making the stacker more adaptive (e.g. by using a GBM stacker), or by using larger batch sizes.

In order to understand if such differences could result from the drift detector, Fig. 8) marks the detected drift points (in red) for each library for the high abrupt drift data stream with AS-3. Drift is always detected after the sudden drift point at batch 250. Autosklearn and GAMA recovers to higher accuracy levels after the AutoML is restarted from scratch, but not H2O. Hence, lack of drift detection is not the reason for H2O’s performance after drift.

For the gradual and mixed drift data streams, Autosklearn and GAMA again have more options adapting to the drift effect compared to H2O. This might be due to the fact that the linear model (GLM) used in H2O stacking process as the metalearner is not adapting well to the changes in data distribution. H2O is still a competitive option for gradual drift data streams with AS-0 chosen as the adaptation strategy.

Comparing Autosklearn and GAMA, although they both have multiple strategies to adapted to abrupt drift, GAMA demonstrates slightly faster recovery (Fig. 2). This difference in recovery speed is bigger when the drift magnitude gets lower. For gradual and mixed drifts, they perform quite similarly and adapt well.

7 Conclusions

We evaluated the performance of some of the most well-known AutoML approaches in scenarios with evolving data streams. The main goal of this study was to gain a deeper understanding on how current AutoML methods are affected by different types of concept drift, and how they could be adapted to become more robust. Different adaptation strategies were designed and tested for this purpose.

Experiments on several real-world data streams that exhibit concept drift confirm that adding change detection to AutoML methods brings the performance to the level of popular online learning techniques and can outperform Oza Bagging on several occasions. The comparisons between different AutoML techniques show that both Bayesian optimization and evolutionary approaches can handle concept drift well, given an appropriate adaptation strategy and forgetting mechanism. Similar to previous studies, this study confirms that different drift characteristics affect learning algorithms in different ways, and that different adaptation strategies may be needed to optimally deal with them.

As such, it is difficult to recommend one superior strategy. However, experiments on artificial concept drift data highlight that re-running AutoML methods when concept drift is detected is beneficial. The additional computational time can be mitigated by decreasing the batch size and consequently the required training time, by decreasing the time budget for optimization, or by applying a warm start or best model adaptation strategy. That said, in many cases simply retraining the pipelines on the latest data without re-tuning them also proved very effective.

In all, this study contributes a set of promising adaptation strategies as well as an extensive empirical evaluation of these strategies, so that informed design choices can be made on how to adapt AutoML techniques to the evolving data setting. It shows that there is ample room to improve existing AutoML systems, and even to design entirely new AutoML systems that naturally adapt to concept drift.

As such, there are many avenues for future work that leverage the insights gained in this study, and use them to design novel AutoML approaches that are more robust to evolving data as it occurs in the real world.


We would like to thank Erin Ledell, Matthias Feurer and Pieter Gijsbers for their advice on adapting their AutoML systems for this study.