Generic adaptation strategies for automated machine learning

12/27/2018 ∙ by Rashid Bakirov, et al. ∙ Bournemouth University 0

Automation of machine learning model development is increasingly becoming an established research area. While automated model selection and automated data pre-processing have been studied in depth, there is, however, a gap concerning automated model adaptation strategies when multiple strategies are available. Manually developing an adaptation strategy, including estimation of relevant parameters can be time consuming and costly. In this paper we address this issue by proposing generic adaptation strategies based on approaches from earlier works. Experimental results after using the proposed strategies with three adaptive algorithms on 36 datasets confirm their viability. These strategies often achieve better or comparable performance with custom adaptation strategies and naive methods such as repeatedly using only one adaptive mechanism.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated model selection has long been studied (Wasserman, 2000) with some notable recent advances (Hutter et al., 2011; Lloyd et al., 2014; Kotthoff et al., 2017). In addition, automatic data pre-processing has also been a topic of recent interest (Feurer et al., 2015; Salvador et al., 2016). There is however a gap concerning automated development of models’ adaptation strategy, which is addressed in this paper. Here we define adaptation as changes in model training set, parameters and structure all designed to track changes in the underlying data generating process over time. This contrasts with model selection which focuses on parameter estimation and the appropriate family to sample the model from. There is a dearth of research on adaptation strategies (Section 2) and the focus of this paper is on what strategy to apply at a given time given the history of adaptations.

With the current advances in data storage, database and data transmission technologies, mining streaming data has become a critical part of many processes. Many models which are used to make predictions on streaming data are static, in the sense that they do not learn on current data and hence remain unchanged. However, there exists a class of models, online learning models, which are capable of adding observations from the stream to their training sets. In spite of the fact that these models utilise the data as it arrives, there can still arise situations where the underlying assumptions of the model no longer hold. We call such settings dynamic environments, where changes in data distribution (Zliobaite, 2011), change in features’ relevance (Fern and Givan, 2000), non-symmetrical noise levels (Schmidt and Lipson, 2007) are common. These phenomena are sometimes called concept drift. It has been shown that many changes in the environment which are no longer being reflected in the model contribute to the deterioration of model’s accuracy over time (Schlimmer and Granger, 1986; Street and Kim, 2001; Klinkenberg, 2004; Kolter and Maloof, 2007). This requires constant manual retraining and readjustment of the models which is often expensive, time consuming and in some cases impossible - for example when the historical data is not available any more. Various approaches have been proposed to tackle this issue by making the model adapt itself to the possible changes in environment while avoiding its complete retraining.

In this paper we aim to bridge the identified gap between automation and adaptation of machine learning algorithms. Typically there are several possible ways or adaptive mechanisms (AMs) to adapt a given model. In this scenario, the adaptation is achieved by deploying one of multiple AMs, which changes the state of the existing model. Note that as explained in Section 3.1, this formulation also includes algorithms with a single adaptive mechanism, if there is a possibility of not deploying it. This would apply to the most, if not all, adaptive machine learning methods. A sequential adaptation framework proposed in earlier works (Bakirov et al., 2017a) separates adaptation from prediction, thus enabling flexible deployment orders of AMs. We call these orders adaptation strategies. Generic adaptation strategies, developed according to this framework can be applied to any set of adaptive mechanisms for various machine learning algorithm. This removes the need to design custom adaptive strategies which results in automation of adaptation process. In this work we empirically show the viability of the generic adaptive strategies based upon techniques shown in Bakirov et al. (2015, 2016), specifically a cross-validatory adaptation strategy with the optional use of retrospective model correction.

We focus on the batch prediction scenario, where data arrives in large segments called batches. This is a common industrial scenario, especially in the chemical, microelectronics and pharmaceutical areas (Cinar et al., 2003). For the experiments we use Simple Adaptive Batch Learning Ensemble (SABLE) (Bakirov et al., 2015), a regression algorithm which uses an ensemble of locally weighted experts to make predictions, and batch versions of two popular online learning algorithms - the Dynamic Weighted Majority (DWM) (Kolter and Maloof, 2007) and the Paired Learner (PL) (Bach and Maloof, 2008). The use of these three algorithms allows to explore different types of online learning methods; local experts ensemble for regression in SABLE, global experts ensemble for classification in DWM and switching between the two methods in PL.

After a large-scale experimentation with five regression and 31 classification data sets, the main finding of this work is that in our settings, the proposed generic adaptive strategies usually show better or comparable accuracy rates with the repeated deployment of a single AM (up to 25% improvement) and the custom adaptive strategies (up to 15% improvement). Thus, they are feasible to use for adaptation purposes, while saving time and effort spent on designing custom strategies.

The novel aspects of this work in comparison to our previously published research in Bakirov et al. (2016, 2017a) include the following:

  • The introduction of a novel concept of generic adaptive strategies for automation of continuously learning predictive models,

  • An extended description and formalisation of the sequential adaptation framework,

  • Significantly extended experimentation (from 3 datasets in previous works to 36 in current one),

  • Consideration of classification problem in addition to regression,

  • Modification of DWM and PL for batch prediction scenario.

The paper is structured as follows; related work is presented in Section 2, Section 3 presents mathematical formulation of the framework of a system with multiple adaptive elements in batch streaming scenario. Section 4 introduces algorithms which were used for the experimentation including the description of the adaptive mechanisms which form the adaptive part of algorithms and their custom adaptation strategies. Experimental methodology, the datasets on which experiments were performed and results are given in 5. We conclude by giving our final remarks in Section 6.

2 Related Work

Adapting machine learning models is an essential strategy for automatically dealing with changes in an underlying data distribution to avoid training a new model manually. Modern machine learning methods typically contain a complex set of elements allowing many possible adaptation mechanisms. This can increase the flexibility of such methods and broaden their applicability to various settings. However, the existence of multiple AMs also increases the decision space with regards to the adaptation choices and parameters, ultimately increasing the complexity of adaptation strategy. A possible hierarchy111Here, the hierarchy is meant in a sense that the application of an adaptive mechanism of the higher level, requires the application of the adaptive mechanism of lower level. of AMs is presented in Figure 1 (Bakirov et al., 2017b).

Figure 1: General adaptations scheme.

In a streaming data setting, to increase the accuracy, it can be beneficial to include recent data in the training set of the predictive models, however retraining a model from scratch is often inefficient, particularly dealing with high throughput scenarios or even impossible when the historical data is no longer available. For these cases, the solution is updating the model using only the available recent data. This can be done implicitly by some algorithms, e.g. Naive Bayes. For methods which do not support implicit update of models, there are often online versions developed to tackle this, e.g. online Least Squares Estimation

(Jang et al., 1997), online boosting and bagging (Oza and Russell, 2001) etc. Additionally, for non-stationary data, it becomes important to not only select a training set of sufficient size but also one which is relevant to the current data. This is often achieved by a moving window Widmer and Kubat (1996); Klinkenberg (2004); Zliobaite and Kuncheva (2010) or decay approachesJoe Qin (1998); Klinkenberg and Joachims (2000). The moving window approaches limit the training data for predictive model to the instances seen most recently. The window size is the only parameter, and is critical in controlling how fast the adaptation is performed - smaller window sizes facilitate faster adaptation but make models less stable and more susceptible to noise. Window size is usually fixed, but dynamic approaches have been proposed in (Widmer and Kubat, 1996; Klinkenberg, 2004). As opposed to binary (0/1) weighting of data resulting from moving window approaches, a continuous decreasing weight could be applied. A simple approach is to use a single decay factor , the repeated use of which leads to the exponential reduction of data’s weight. Decay can be based not only on time of instances’ arrival (Joe Qin, 1998; Klinkenberg and Joachims, 2000), but also on similarity to the current data (Tsymbal et al., 2008), combination thereof (Zliobaite, 2011), density of the input data region (Salganicoff, 1993b) or consistency with new concepts (Salganicoff, 1993a).

The structure of a predictive model is a graph with the set of its components and the connections therein. Some common examples are hierarchical models (e.g. decision trees) or more complex graphs (e.g. Bayesian or neural networks). Here, the structure is not necessarily limited to the topological context – number of rules in rule based systems or number of experts in an ensemble could be considered part of the model’s structure. Updates in the models’ structure are also used for the adaptation purposes, for example in decision and model trees

(Domingos and Hulten, 2000; Hulten et al., 2001; Potts and Sammut, 2005; Basak, 2006; Ikonomovska et al., 2010), neural networks (Carpenter et al., 1991; Vakil-Baghmisheh and Pavešić, 2003; Ba and Frey, 2013)

, Bayesian networks

(Friedman and Goldszmidt, 1997; Lam, 1998; Alcobé, 2004; Castillo and Gama, 2006) and ensemble methods (Stanley, 2002; Kolter and Maloof, 2007; Hazan and Seshadhri, 2009; Gomes Soares and Araújo, 2015; Bakirov et al., 2017a).

Finally, many models have parameters which decide their predictions and so changing them adapts the model. This can be seen in adaptive version of Least Squares Estimation (Jang et al., 1997), neural networks back-propagation (Werbos, 1974) as well as more recent methods like experience replay (Lin, 1992)

and the Long Short-term Memory

(Hochreiter and Schmidhuber, 1997). For ensemble methods, expert weights are one of the important parameters of the ensemble methods, which are often recalculated or updated throughout a models’ run time (Littlestone and Warmuth, 1994; Kolter and Maloof, 2007; Elwell and Polikar, 2011; Kadlec and Gabrys, 2011; Bakirov et al., 2017a).

In this work we consider the possibility of using multiple different adaptive mechanisms, most often at different levels of the presented hierarchy. Many modern machine learning algorithms for streaming data explicitly include this possibility. A prominent example are the adaptive ensemble methods (Wang et al., 2003; Kolter and Maloof, 2007; Scholz and Klinkenberg, 2007; Bifet et al., 2009; Kadlec and Gabrys, 2010; Elwell and Polikar, 2011; Alippi et al., 2012; Souza and Araújo, 2014; Gomes Soares and Araújo, 2015; Bakirov et al., 2017a). Many of these often feature AMs from all three levels of hierarchy - online update of experts, changing experts’ combination weights and modification of experts’ set. Machine learning methods with multiple AMs are not limited to ensembles, being based on Bayesian networks (Castillo and Gama, 2006), decision trees (Hulten et al., 2001), model trees (Ikonomovska et al., 2010), champion-challenger schemes (Nath, 2007; Bach and Maloof, 2008) etc.

Many of the adaptive mechanisms described above, such as moving windows and predictors’ combination, can be applied to most of non-adaptive machine learning models in order to enhance them with adaptation capabilities. Thus, the modeller is able to deploy multiple AMs on the multitude of algorithms. However, existence of multiple AMs raises questions w.r.t. how should they be deployed. This includes defining the order of deployment and adaptation parameters (e.g. decay factors, expert weight decrease factors, etc.). It should be noted that all of the aforementioned algorithms deploy AMs in a custom manner, meaning that their adaptive strategies is specific to each of them. This can make designing adaptive machine learning methods a complex enterprise and is an obstacle to the automation of machine learning model’s design. Kadlec and Gabrys (2009) have proposed a plug and play architecture for pre-processing, adaptation and prediction which foresees the possibility of using different adaptation methods in a modular fashion, but does not address the method of AM selection. Bakirov et al. (2015, 2016) have presented several such methods for AM selection for their adaptive algorithm, which are discussed in detail in the Section 3.2. These methods can be seen as generic adaptive strategies, which, as opposed to custom adaptive strategies, use a single adaptive strategy applicable to all adaptive machine learning methods with multiple AMs. This allows easier automated adaptation - when the designer does not need to construct a specific custom adaptation algorithm for a machine learning method.

3 Formulation

As adaptation mechansisms can effect several elements of a model and can depend on performance several time steps back, it is necessary to make concrete the meaning via a framework to avoid confusion. We assume that the data is generated by an unknown time varying data generating process which can be formulated as:

(1)

where is the unknown function, a noise term, is an input data instance, and is the observed output at time . Then we consider the predictive method at a time as a function:

(2)

where is the prediction, is an approximation (i.e. the model) of , and is the associated parameter set. Our estimate, , evolves via adaptation as each batch of data arrives as is now explained.

3.1 Adaptation

In the batch streaming scenario considered in this paper, data arrives in batches with , where is the start time of the -th batch. If is the size of the -th batch, . It then becomes more convenient to index the model by the batch number , denoting the inputs as , the outputs as . We examine the case where the prediction function is static within a -th batch.222A batch typically represents a real-world segmentation of the data which is meaningful, for example a plant run and so our adaptation attempts to track run to run changes in the process. We also found in our experiments that adapting within a batch can be detrimental as it leads to drift in the models.

We denote the a priori predictive function at batch as , and the a posteriori predictive function, i.e. the adapted function given the observed output, as . An adaptive mechanism, , may thus formally be defined as an operator which generates an updated prediction function based on the batch and other optional inputs. This can be written as:

(3)

or alternatively as for conciseness. Note and are optional arguments and is the set of parameters of . The function is propagated into the next batch as and predictions themselves are always made using the a priori function .

We examine a situation when a choice of multiple, different AMs,
, is available. Any AM can be deployed on each batch, where denotes the AM deployed at batch . As the history of all adaptations up to the current batch, , have in essence created , we call that sequence an adaptation sequence. Note that we also include the option of applying no adaptation denoted by , thus any adaptive algorithm fits our framework, as long as there is an option of not adapting. In this formulation, only one element of is applied for each batch of data. Deploying multiple adaptation mechanisms on the same batch are accounted for with their own symbol in . Figure 1(a) illustrates our initial formulation of adaptation.

3.2 Generic adaptation strategies

In this section we present different strategies we examined to understand better the issues surrounding flexible deployment of AMs and assist in the choice of adaptation sequence.

At every batch , an AM must be chosen to deploy on the current batch of data. To obtain a benchmark performance, we use an adaptation strategy which minimizes the error over the incoming data batch :

(4)

where denotes the chosen error measure333In this paper Mean Absolute Error (MAE) is used.. Since are not yet obtained, this strategy is not applicable in the real life situations. Also note that this may not be the overall optimal strategy which minimizes the error over the whole dataset. While discussing the results in the Section 5 we refer to this strategy as Oracle.

Given the inability to conduct the Oracle strategy, below we list the alternatives. The simplest adaptation strategy is applying the same AM to every batch (these are denoted Sequence1, Sequence2 etc. in Section 5). The scheme of this strategy is given in Figure 2(a). Note that this scheme fits the “Adaptation” box in Figure 1(a). A more common practice (see Section 2) is applying all adaptive mechanisms, denoted as Joint in Section 5. The scheme of this strategy is given in Figure 2(b) which again fits the “Adaptation” box in Figure 1(a).

(a)
(b)
Figure 2: (a) Adaptation scheme. (b) Adaptation scheme with retrospective correction. Here and represents the result of retrospective correction. Depending on the algorithm, inputs can be optional.
(a) Simple
(b) Joint
(c) XVSelect
(d) Retrospective correction
Figure 3: Generic adaptation strategies.

As introduced in Bakirov et al. (2015), it is also possible to use for the choice of . Given observations, the a posteriori prediction error is . However, this is effectively an in-sample error as is a function of .444As a solid example consider the case where is retrained using . In this case are part of the training set and so we risk overfitting the model if we also evaluate the goodness of fit on . To obtain a generalised estimate of the prediction error we apply q-fold555In subsequent experiments, cross validation. The cross-validatory adaptation strategy (denoted as XVSelect) uses a subset (fold), , of to adapt; i.e. and the remainder, , is used to evaluate, i.e. find . This is repeated q times resulting in 10 different error values and the AM, , with the lowest average error measure is chosen. In summary:

(5)

where denotes the cross validated error. The scheme of XVSelect for is given in Figure 2(c).

The next strategy can be used in combination with any of the above strategies as it focuses on the history of the adaptation sequence and retrospectively adapts two steps back. This is called the retrospective model correction (Bakirov et al., 2016). Specifically, we estimate which adaptation at batch would have produced the best estimate in block :

(6)

Using the cross-validated error measure in Equation 6 is not necessary, because is independent of . Also note the presence of ; retrospective correction does not in itself produce a and so cannot be used for prediction unless it is combined with another strategy (). This strategy can be extended to consider the sequence of AMs while choosing the optimal state for the current batch, which we call -step retrospective correction:

(7)

The scheme for XVSelectRC is given in Figure 2(d).

Since the retrospective correction can be deployed alongside any adaptation scheme, we modify the general adaptation scheme (Figure 1(a)) accordingly, resulting in Figure 1(b), where Figure 2(d) fits in the box “Correction”. Notice that when using this approach, the prediction function , which is used to generate predictions, can be different from the propagation function which is used as input for adaptation.

We next examine the prediction algorithms with respective adaptive mechanisms (the set ) used in this research.

4 Algorithms

For our experiments we have chosen SABLE (Bakirov et al., 2015) to address regression problem and have developed batch versions of the Dynamic Weighted Majority (DWM) (Kolter and Maloof, 2007) as well as the Paired Learner (PL) (Bach and Maloof, 2008) to address classification problem. These three algorithms allow to explore different types of online learning methods and different adaptive mechanisms. The use of latter two algorithms also demonstrate that the adaptive strategies described in this paper are in fact generic and can be applied to various adaptive algorithms with multiple AMs. Below the details of each algorithm are presented666SABLE was previously described in Bakirov et al. (2015), Sections 4, 5, 6. To make this work self contained, we repeat the description of the algorithm again in this Section..

4.1 Simple Adaptive Batch Local Ensemble

SABLE is an extension of the ILLSA algorithm described in Kadlec and Gabrys (2011). ILLSA uses an ensemble of models, called base learners, with each base learner implemented using a linear model formed through Recursive Partial Least Squares (RPLS) (Joe Qin, 1998). To get the final prediction, the predictions of base learners are combined using input/output space dependent weights (i.e. local learning). SABLE differs from ILLSA in that it is designed for batches of data whereas ILLSA works and adapts on the basis of individual data points. Furthermore, SABLE supports the creation and merger of base learners. RPLS was chosen as a base learner because it is widely used for predictions in chemical processes where high dimensional datasets tend to have low-dimensional embeddings. Furthermore RPLS can be updated without requiring the historical data and the merging of two models can be easily realised. Figure 4 shows the diagram of the SABLE model.

Figure 4: Block diagram of SABLE model.

The relative (to each other) performance of experts varies in different parts of the input/output space. In order to quantify this a descriptor is used. Descriptors of experts are distributions of their weights with the aim to describe the area of expertise of the particular local expert. They describe the mappings from a particular input, , and output, , to a weight, denoted , where is the input feature777For the base methods which transform the input space, such as PLS, the transformed input arguments are used instead of original ones. and is the -th expert. The descriptor is constructed using a two-dimensional Parzen window method (Parzen, 1962) as:

(8)

where is the training data used for expert, is the number of instances it includes, is the weight of sample point’s contribution which is defined below, is the th sample of , is two-dimensional Gaussian kernel function with mean value

and variance matrix

with the kernel width, , at the diagonal positions.

, is unknown and must be estimated as a hyperparameter of the overall algorithm

888

In this research the inputs are first divided by their standard deviation so allowing us to assume an isotropic kernel for simplicity and also to reduce the number of parameters to be estimated.

.

The weights for the construction of the descriptors (see Eq. 8) are proportional to the prediction error of the respective local expert:

(9)

Finally, considering that there are input variables and models, the descriptors may be represented by a matrix, called the descriptor matrix.

During the run-time phase, SABLE must make a prediction of the target variable given a batch of new data samples. This is done using a set of trained local experts and descriptors . Each expert makes a prediction for a data instance . The final prediction is the weighted sum of the local experts’ predictions:

(10)

where is the weight of the

-th local expert’s prediction. The weights are calculated using the descriptors, which estimate the performance of the experts in the different regions of the input space. This can be expressed as the posterior probability of the

-th expert given the test sample and the local expert prediction :

(11)

where is the a priori probability of the -th expert999Equal for all local experts in our implementation, different values could be used for experts’ prioritization., is a normalisation factor and is the likelihood of given the expert, which can be calculated by reading the descriptors at the positions defined by the sample and prediction :

(12)

Eq. 12 shows that the descriptors are sampled at the position which are given on one hand by the scalar value of the -th feature of the sample point and on the other hand by the predicted output of the local expert corresponding to the th receptive field. Sampling the descriptors at the positions of the predicted outputs may result in different outcome than sampling at the positions of correct target values, because the predictions are not necessarily similar to the correct values. However the correct target values are not available at the time of the prediction. The rationale for this approach is that the local expert is likely to be more accurate if it generates a prediction which conforms with an area occupied by a large number of true values during the training phase. To reduce the number of redundant experts, after the processing of batch , some of those that deliver similar predictions on

are removed with their descriptors merged. This process is implemented as follows. The prediction vectors of each expert

on batch ,

are obtained. The similarities between prediction vectors are pairwise tested using Student’s t-test

(Student, 1908)101010

T-test assumes the normal distribution of prediction vectors, which may not always be the case. In these cases, non-parametric tests which relax this assumption, for example Mann-Whitney U test

(Mann and Whitney, 1947) may be considered.
. Then -values of t-test results between each expert pair’s prediction values are

The pruning is conducted if where (maximum value of ) and is the significance threshold chosen as 0.05. During the pruning, the older of the two experts, and , is removed, while their descriptors are added together to create a merged descriptor. This process is repeated until .

The SABLE algorithm allows the use of four different adaptive mechanisms. AMs are deployed as soon as the true values for the batch are available and before predicting on the next batch. It is also possible that none of them are deployed. The SABLE AMs are described below. It should be noted, that as SABLE was conceived as an experimentation vehicle for AM sequences effects exploration, it doesn’t provide a default custom adaptation strategy.

SAM0. No adaptation. No changes are applied to the predictive model, corresponding to . This AM will be denoted as SAM0.

SAM1. Batch Learning. The simplest AM augments existing data with the data from the new batch and retrains the model. Given predictions of each expert on , and measurements of the actual values, , is partitioned into subsets in the following fashion:

(13)

for every instance . This creates subsets such that . Then each expert is updated using the respective dataset . This process updates experts only with the instances where they achieve the most accurate predictions, thus encouraging the specialisation of experts and ensuring that a single data instance is not used in the training data of multiple experts. This AM will be denoted as SAM1 in the description of the experiments below.

SAM2. Batch Learning With Forgetting. This AM is similar to one above but uses decay which reduces the weight of the experts historical training data, making the most recent data more important. It is realised via RPLS update with forgetting factor . is a hyperparameter of SABLE.

SAM3. Descriptors update. This AM recalculates the local descriptors using the new batch creating a new descriptor set . These are merged with a previous descriptors set, in the following fashion:

(14)

for all experts and features , where and are respective update weights associated with old and new descriptors and . This means that when , descriptors update is essentially their recalculation using the most recent batch. The descriptor update weights are hyperparameters of SABLE. This AM will be denoted as SAM3.

SAM4. Creation of New Experts. New expert is created from . Then it is checked if any of the experts from , where is the experts pool after processing of batch , can be pruned as described earlier. Finally the descriptors of all resulting experts are updated.This AM will be denoted as SAM4.

4.2 Batch Dynamic Weighted Majority

Batch Dynamic Weighted Majority (bDWM) is an extension of DWM designed to operate on batches of data instead of on single instances as in the original algorithm. bDWM is a global experts ensemble. Assume a set of experts which produce predictions where with input and a set of all possible labels . Then for all and the matrix with following elements can be calculated:

(15)

Assuming weights vector for respective predictors in , the sum of the weights of predictors which voted for label is . The final prediction is:111111This definition is adapted from Kuncheva (2004).

(16)

bDWM starts with a single expert (in our implementation a Naive Bayes was used as base learner) and can be adapted using an arbitrary sequence of AMs (batch learning, weights update and expert creation) given in following sections.

DAM0. No adaptation. No changes are applied to the predictive model, corresponding to . This AM will be denoted as DAM0.

DAM1. Batch Learning. After the arrival of the batch at time each expert is updated with it. This AM will be denoted as DAM1.

DAM2. Weights Update and Experts Pruning. Weights of experts are updated using following rule:

(17)

where is the weight of the -th expert at time , and is its accuracy on the batch . The weights of all experts in ensemble are then normalized and the experts with a weight less than a defined threshold are removed. It should be noted that the choice of factor is inspired by Herbster and Warmuth (1998), although due to different algorithm settings, the theory developed there is not readily applicable to our scenario. Weights update is different to the original DWM, which uses an arbitrary factor to decrease the weights of misclassifying experts. This AM will be denoted as DAM2.

DAM3. Creation of a New Expert. New expert is created from the batch and is given a weight of 1. This AM will be denoted as DAM3.

Figure 5: bDWM custom adaptation strategy.

bDWM Custom Adaptive Strategy. Having described the separate adaptive mechanisms, we now give the custom adaptive strategy for bDWM, adapted for batch setting from the original DWM. It starts with a single expert with a weight of one. At time , after an arrival of new batch , experts makes predictions and overall prediction is calculated as shown earlier in this section. After the arrival of true labels all experts learn on the batch (envoking DAM1), update their weights (DAM2) and ensemble’s accuracy is calculated. If

accuracy is less than the accuracy of the naive majority classifier (based on all the batches of data seen up to this point) on the last batch, a new expert is created (DAM3). The schematics of this strategy is shown in Figure

5. This scheme fits in “Adaptation” boxes in Figures 1(a) and 1(b).

Figure 6: bPL custom adaptation strategy.

4.3 Batch Paired Learner

Batch Dynamic Weighted Majority (bPL) is an extension of PL designed to operate on batches of data instead of on single instances as in the original algorithm. bPL maintains two learners - a stable learner which is updated with all of incoming data and which is used to make predictions, and a reactive learner, which is trained only on the most recent batch. Our implementation uses naive Bayes base learner. For this method, two adaptive mechanisms are available, which are described in sections below.

PAM1. Updating Stable Learner. After the arrival of the batch at time , stable learner is updated with it. This AM will be denoted as PAM1.

PAM2. Switching to Reactive Leaner. Current stable learner is discarded and replaced by reactive learner. This AM will be denoted as PAM2.

bPL Custom Adaptive Strategy. Having described the separate adaptive mechanisms, we now give the custom adaptive strategy for bPL, adapted for batch setting from the original bPL. This adaptive strategy revolves around comparing the accuracy values of stable () and reactive learners on each batch of data. Every time when a change counter is incremented. If the counter is higher than a defined threshold , an existing stable learner is discarded and replaced by the reactive learner, while the counter is set to 0. As before, a new reactive learner is trained from each subsequent batch. The schematics of this strategy are shown in Figure 6. This scheme fits in “Adaptation” boxes in Figures 1(a) and 1(b).

5 Experimental results

The goal of experiments given in this section was the empirical comparison of generic adaptation strategies proposed in 3.2 with strategies involving repeated deployment of one or all available AMs and custom adaptive strategies. This section discusses the results in order of introduced algorithms. The experimentation was performed using 10 real world datasets (5 for regression and 5 for classification) and 26 synthetic datasets (for classification). Brief descriptions and characteristics of each of them are provided in Tables 1 (real world regression datasets), 3 (real world classification datasets) and 2 (synthetic classification datasets). Synthetic data is visualised in Figure 7 121212Code for bDWM and bPL, as well as all the datasets except Oxidizer and Drier can be found on https://github.com/RashidBakirov/multiple-adaptive-mechanisms. SABLE code and the specified two datasets could not be shared because of confidentiality agreement with the project partner..

# Name Description
1 Catalyst activation 5867 12 Highly volatile simulation (real conditions based) of catalyst activation in a multi-tube reactor. Task is the prediction of catalyst activity while inputs are flow, concentration and temperature measurements (Strackeljan, 2006).
2 Thermal oxidiser 2820 36 Prediction of exhaust gas concentration during an industrial process, moderately volatile. Input features include concentration, flow, pressure and temperatures measurements (Kadlec and Gabrys, 2009).
3 Industrial drier 1219 16 Prediction of residual humidity of the process product, relatively stable. Input features include temperature, pressure and humidity measurements (Kadlec and Gabrys, 2009).
4 Debutaniser column 2394 7 Prediction of butane concentration at the output of the column. Input features are temperature, pressure and flow measurements (Fortuna et al., 2005).
5 Sulfur recovery 10081 6 Prediction of in the output of sulfur recovery unit. Input features are gas and air flow measurements (Fortuna et al., 2003).
Table 1: Regression datasets. stands for number of instances and for number of features.
# Data type Drift Noise/overlap
1 Hyperplane 600 2 2x50% rotation None
2 Hyperplane 600 2 2x50% rotation 10% uniform noise
3 Hyperplane 600 2 9x11.11% rotation None
4 Hyperplane 600 2 9x11.11% rotation 10% uniform noise
5 Hyperplane 640 2 15x6.67% rotation None
6 Hyperplane 640 2 15x6.67% rotation 10% uniform noise
7 Hyperplane 1500 4 2x50% rotation None
8 Hyperplane 1500 4 2x50% rotation 10% uniform noise
9 Gaussian 1155 2 4x50% switching 0-50% overlap
10 Gaussian 1155 2 10x20% switching 0-50% overlap
11 Gaussian 1155 2 20x10% switching 0-50% overlap
12 Gaussian 2805 2 4x49.87% passing 0.21-49.97% overlap
13 Gaussian 2805 2 6x27.34% passing 0.21-49.97% overlap
14 Gaussian 2805 2 32x9.87% passing 0.21-49.97% overlap
15 Gaussian 945 2 4x52.05% move 0.04% overlap
16 Gaussian 945 2 4x52.05% move 10.39% overlap
17 Gaussian 945 2 8x27.63% move 0.04% overlap
18 Gaussian 945 2 8x27.63% move 10.39% overlap
19 Gaussian 945 2 20x11.25% move 0.04% overlap
20 Gaussian 945 2 20x11.25% move 10.39% overlap
21 Gaussian 1890 4 4x52.05% move 0.013% overlap
22 Gaussian 1890 4 4x52.05% move 10.24% overlap
23 Gaussian 1890 4 8x27.63% move 0.013% overlap
24 Gaussian 1890 4 8x27.63% move 10.24% overlap
25 Gaussian 1890 4 20x11.25% move 0.013% overlap
26 Gaussian 1890 4 20x11.25% move 10.24% overlap
Table 2: Synthetic classification datasets used in experiments, from Bakirov and Gabrys (2013). Column “Drift” specifies number of drifts/changes in data, the percentage of change in the decision boundary and its type. stands for number of instances and for number of classes. All datasets have 2 input features.
# Name Brief description
27 Australian electricity prices (Elec2) 27887 6 2 Widely used concept drift benchmark dataset thought to have seasonal and other changes as well as noise. Task is the prediction of whether electricity price rises or falls while inputs are days of the week, times of the day and electricity demands (Harries, 1999).
28 Power Italy 4489 2 4 The task is prediction of hour of the day (03:00, 10:00, 17:00 and 21:00) based on supplied and transferred power measured in Italy (Zhu, 2010; Chen et al., 2015).
29 Contraceptive 4419 9 3 Contraceptive dataset from UCI repository (Newman et al., 1998) with artificially added drift (Minku et al., 2010).
30 Iris 450 4 4 Iris dataset (Anderson, 1936; Fisher, 1936) with artificially added drift (Minku et al., 2010).
31 Yeast 5928 8 10 Contraceptive dataset from UCI repository (Newman et al., 1998) with artificially added drift (Minku et al., 2010).
Table 3: Real world classification datasets. stands for number of instances, for number of features and for number of classes.
Figure 7: Synthetic datasets visualisation (Bakirov and Gabrys, 2013).

5.1 Sable

5.1.1 SABLE methodology

We use the adaptive strategies presented in Table 4 for our experimental comparison. The tests were run on 5 real world datasets from process industry. It has been shown, e.g. in Salvador et al. (2016); Bakirov et al. (2017a) that these datasets present different levels of volatility and noise. Brief characteristics of datasets are given in the Table 1, for more detailed descriptions reader is referred to the publications above. Three different batch sizes for each dataset are examined in the simulations together with a mix of typical parameter settings as tabulated in Table 5. These parameter combinations were empirically identified.

Strategy Description
Sequence0 Deploy SAM0 on every batch. This means that only the first batch of data is used to create an expert.
Sequence1 Deploy SAM1 on every batch.
Sequence2 Deploy SAM2 on every batch.
Sequence3 Deploy SAM3 on every batch.
Sequence4 Deploy SAM4 on every batch.
Joint Deploy SAM2 and SAM4 (in this order) on every batch. This strategy deploys all of the available adaptive mechanisms (batch learning, addition of new experts and change of weights )
XVSelect Select AM based on the current data batch using the cross-validatory approach described in the Section 3.2.
XVSelectRC Select AM based on the current data batch using the cross-validatory approach using retrospective correction as described in the Section 3.2.

Table 4: SABLE Adaptive strategies
Dataset Number of batches
Catalyst 50 117 0, 1 0.5 1 12
Catalyst 100 59 0, 1 0.25 1 12
Catalyst 200 30 0, 1 0.5 1 12
Oxidizer 50 47 0.25, 0.75 0.5 1 3
Oxidizer 100 29 0, 1 0.25 0.01 3
Oxidizer 200 15 0, 1 0.25 0.01 3
Drier 50 25 0, 1 0.25 0.01 16
Drier 100 13 0, 1 0.5 0.1 16
Drier 200 7 0, 1 0.25 0.01 16
Debutaniser 50 47 0.25, 0.75 0.5 6
Debutaniser 100 23 0.25, 0.75 0.25 6
Debutaniser 200 11 0, 1 0.5 1 6
Sulfur 50 201 0.25, 0.75 0.5 1 7
Sulfur 100 100 0, 1 0.5 0.1 7
Sulfur 200 0, 1 0.5 0.1 7
Table 5: SABLE parameters for different datasets. Here, is batch size, are update weights of descriptors, is RPLS forgetting factor, is kernel width for descriptor construction and is the number of RPLS latent variables.

5.1.2 SABLE Results

To analyse the usefulness of proposed generic adaptation strategies, we compare the normalised MAE131313For real datasets in this work we use prequential evaluation (Dawid, 1984), i.e. we apply the model on the incoming batch of data, calculate the error/accuracy, then use this batch to adapt the model and proceed with the next batch. values between them and the most accurate single AM deployment strategy across the datasets (denoted as BestAM). Note that this most accurate strategy varies from dataset to dataset and is usually not known in advance. The results of these comparison are given in Figure 8. These results suggest that most of the times XVSelect and XVSelectRC perform better or comparable to BestAM. The exclusions are drier dataset with batch size of 100 and sulfur dataset (all batch sizes). We relate this to the stability of these datasets. Indeed, the BestAM in all these cases is the slow adapting Sequence1, without any forgetting of the old information. Difference in batch sizes is important for some datasets. This can be related to the frequency of changes and whether they happen inside a batch, which can have a negative impact on XVSelect and XVSelectRC. Retrospective correction (XVSelectRC) can drastically improve the performance of XVSelect for some cases.

(a)
(b)
(c)
Figure 8: Normalised MAE values (lower is better) of SABLE XVSelect, XVSelectRC and the most accurate single AM strategy for different batch sizes . See Table 1 for dataset numbers.

5.2 bDWM

5.2.1 bDWM methodology

We use the adaptive strategies presented in Table 6 for our experimental comparison. These strategies consist of fixed deployment sequences of AMs given in Section 4.2, the custom bDWM adaptation strategy and the proposed generic strategies. The tests were run on classification datasets shown in Tables 2, as well as 5 real world datasets from various areas (Table 3). The datasets have different rate of change and noise. Two different batch sizes - 10 and 20 for each dataset are examined. Prtools (Duin et al., 2007) implementation of Naive Bayes was used as a base learner.

Strategy Description
Sequence0 Deploy DAM0 on every batch. This means that only the first batch of data is used to create an expert.
Sequence1 Deploy DAM1 on every batch.
Sequence2 Deploy DAM2 on every batch.
Sequence21 Deploy DAM2 followed by DAM1 on every batch.
Sequence3 Deploy DAM3 on every batch.
Sequence13 Deploy DAM1 followed by DAM3 on every batch.
Sequence23 Deploy DAM2 followed by DAM3 on every batch.
Sequence213 Deploy DAM2 followed by DAM1 followed by DAM3 on every batch.
bDWM Deploy bDWM custom adaptation strategy.
XVSelect Select AM based on the current data batch using the cross-validatory approach described in the Section 3.2.
XVSelectRC Select AM based on the current data batch using the cross-validatory approach using retrospective correction as described in the Section 3.2.

Table 6: bDWM Adaptive strategies

Results on synthetic data. The results of comparison are shown in Figure 9. XVSelect and XVSelectRC accuracy values141414For synthetic datasets in this work we generate an additional 100 test data instances for each single instances in training data using the same distribution. The predictive accuracy on the batch is then measured on test data relevant to that batch. This test data is not used for training or adapting models. are comparable to those of custom adaptive strategy and the deployment of AM which provides the most accurate results (BestAM). The few exceptions are datasets 3, 4, 5, 6 for batch size 10 and 4, 6 for batch size 20. These datasets are characterised by frequent gradual change, where the classes are replaced by each other. Thus XVSelect and XVSelectRC may select AM based on potentially outdated data, resulting in drop of their accuracy. This issue is exacerbated by the existence of noise (datasets 4 and 6). It is also interesting to note the relatively poor accuracy of bDWM algorithm in datasets 21-26, where the proposed strategies almost always outperform it. XVSelect seems to benefit from a larger batch size, but generally, XVSelectRC is often more accurate and more stable.

(a)
(b)
Figure 9: Accuracy values (higher is better) of bDWM custom adaptive strategy (denoted as bDWM), XVSelect, XVSelectRC and the most accurate single AM strategy for different batch sizes on synthetic data. See Table 2 for dataset numbers.
(a)
(b)
Figure 10: Accuracy values (higher is better) of bDWM custom adaptive strategy (denoted as bDWM), XVSelect, XVSelectRC and the most accurate single AM strategy for different batch sizes on real data. See Table 3 for dataset numbers.

Results on real data. The comparison results are shown in FIgure 10. Generally for these datasets XVSelect and XVSelectRC have comparable accuracy rates with bDWM custom algorithm as well as BestAM. A slight exception is dataset 28 (PowerItaly) with batch size of 10. XVSelectRC provides more stable accuracy than XVSelect for these cases as well.

5.3 bPL

5.3.1 bPL methodology

We use the adaptive strategies presented in Table 6 for our experimental comparison. These strategies consist of fixed deployment sequences of AMs given in Table 4.3, the custom bDWM adaptation strategy and the proposed generic strategies. It must be noted that bPL includes only two adaptive strategies, which cannot be used jointly as the deployment of one precludes the deployment of the other. The tests were run on the same synthetic and real datasets as bDWM experiment (tables 2, 3). Two different batch sizes - 10 and 20 for each dataset are examined again. For bPL custom algorithm we have experimented with thresholds of . Prtools Duin et al. (2007) implementation of Naive Bayes was used as a base learner.

Strategy Description
Sequence0 Deploy PAM0 on every batch. This means that only the first batch of data is used to create an expert.
Sequence1 Deploy PAM1 on every batch.
bPL Deploy bPL custom adaptation strategy.
XVSelect Select AM based on the current data batch using the cross-validatory approach described in the Section 3.2.
XVSelectRC Select AM based on the current data batch using the cross-validatory approach using retrospective correction as described in the Section 3.2.

Table 7: bPL Adaptive strategies

Results on synthetic data. The results of comparison are shown in Figure 11. For this case as well, XVSelect and XVSelectRC accuracy values are comparable to those of custom adaptive strategy and the deployment of AM which provides the most accurate results (BestAM). This case is characterised by a relatively poor performance of BestAM. This is especially noticeable for datasets 7 and 8 (hyperplane datasets with 4 classes). It is evident that only a single AM is not able to handle them - PAM0 does not adapt fast enough to changes and PAM1’s training data of a single batch is not enough to build a meaningful classifier. XVSelect seems to benefit from a larger batch size, but generally, XVSelectRC is often more accurate and more stable. The size of batch does not appear to make a big difference for XVSelect and XVSelectRC, however, increasing it is helpful for BestAM. For these experiments bPL threshold was used, as it delivered the best results among the tried settings.

(a)
(b)
Figure 11: Accuracy values (higher is better) of bPL custom adaptive strategy (denoted as bPL), XVSelect, XVSelectRC and the most accurate single AM strategy for different batch sizes on synthetic data. See Table 2 for dataset numbers.
(a)
(b)
Figure 12: Accuracy values (higher is better) of bPL custom adaptive strategy (denoted as bPL), XVSelect, XVSelectRC and the most accurate single AM strategy for different batch sizes on real data. See Table 3 for dataset numbers.

Results on real data. Comparison results on real data are shown in Figure 12. For datasets 31 (Yeast) and to some extent 29 (Contraceptive), the performance of XVSelect and XVSelectRC is noticeably worse than those of bPL custom algorithm, while in other cases the proposed strategies are more accurate. The limited batch sizes may not have been sufficient for XVSelect and XVSelectRC in case of Yeast, which has 10 target classes. It is seen that with the larger batch size, the results improve. For these experiments bPL threshold was used, as it delivered the best results among the tried settings.

6 Discussion and Conclusions

The core aim of this paper was to explore the issue of automating the adaptation of predictive algorithms, which was found to be a rather overlooked direction in otherwise popular area of automated machine learning. In our research, we have addressed this by utilising a simple, yet powerful adaptation framework, which separates adaptation from prediction, defines adaptive mechanisms and adaptive strategies, as well as allows the use of retrospective model correction. This adaptation framework enables the development of generic adaptation strategies, which can be deployed on any set of adaptive mechanisms, thus facilitating the automation of predictive algorithms’ adaptation.

We have used several generic adaptation strategies, based on cross-validation on the current batch and retrospectively reverting the model to the oracle state after obtaining the most recent batch of data. We postulate that the recently seen data is likely to be more related to the incoming data, therefore these strategies tend to steer the adaptation of the predictive model to achieve better results on the most recent available data.

To confirm our assumptions, we have empirically investigated the merit of generic adaptation strategies XVSelect and XVSelectRC. For this purpose we have conducted experiments on 10 real and 26 synthetic datasets, exhibiting various levels of adaptation need.

The results are promising, as for the majority of these datasets, the proposed generic approaches were able to demonstrate comparable performance to those of specifically designed custom algorithms and the repeated deployment of any single adaptive mechanism. The exceptions were analysed. It is postulated that the reasons of the poor performance of XVSelect and XVSelectRC for these cases were a) lack of change/need for adaptation b) insufficient data in a batch c) gradual replacement of classes by each other. The issues (a) and (b) have trivial solutions, whereas (c) may require further improvement of generic adaptation strategies.

A benefit of proposed generic adaptation strategies is that they can help designers of machine learning solutions save time by not having to devise a custom adaptive strategy. XVSelect and XVSelectRC are generally parameter-free, except for number of cross validation folds, choosing which is relatively trivial.

This research has focused on batch scenario, which is natural for many use cases. Adopting the introduced generic adaptive strategies for incremental learning scenario remains a future research question. In that case a lack of batches would for example pose a question of which data should be used for cross validation? This could be addressed using data windows of static or dynamically changing size. Another useful scope of research is focusing on a semi-supervised scenario, where true values or labels are not always available. This is relevant for many applications, amongst them in the process industry. An additional future research direction is theoretical analysis of this direction of research where relevant expert/bandit strategies may be useful.

In general, there is a rising tendency of modular systems for construction of machine learning solutions, where adaptive mechanisms are considered as separate entities, along with pre-processing and predictive techniques. One of the features of such systems is easy, and often automated plug-and-play machine learning architectures. Generic adaptive strategies introduced in this paper further contribute towards this automation.

Acknowledgements.
The authors would like to thank Evonik Industries AG for the provided datasets. Part of the used Matlab code originates from P. Kadlec and R. Grbić.

References

  • Alcobé (2004) Alcobé JR (2004) Incremental Hill-Climbing Search Applied to Bayesian Network Structure Learning. In: Proceedings of the Eighth European Conference on Principles and Practice of Knowledge Discovery in Databases, Volume 3202 of Lecture Notes in Computer Science. Springer Verlag
  • Alippi et al. (2012) Alippi C, Boracchi G, Roveri M (2012) Just-in-time ensemble of classifiers. In: The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8, DOI 10.1109/IJCNN.2012.6252540
  • Anderson (1936) Anderson E (1936) The Species Problem in Iris. Annals of the Missouri Botanical Garden 23(3):457, DOI 10.2307/2394164
  • Ba and Frey (2013) Ba J, Frey B (2013) Adaptive dropout for training deep neural networks. In: NIPS’13 Proceedings of the 26th International Conference on Neural Information Processing Systems, pp 3084–3092
  • Bach and Maloof (2008) Bach SH, Maloof MA (2008) Paired Learners for Concept Drift. 2008 Eighth IEEE International Conference on Data Mining pp 23–32, DOI 10.1109/ICDM.2008.119
  • Bakirov and Gabrys (2013)

    Bakirov R, Gabrys B (2013) Investigation of Expert Addition Criteria for Dynamically Changing Online Ensemble Classifiers with Multiple Adaptive Mechanisms. In: Papadopoulos H, Andreou A, Iliadis L, Maglogiannis I (eds) Artificial Intelligence Applications and Innovations, pp 646–656

  • Bakirov et al. (2015) Bakirov R, Gabrys B, Fay D (2015) On sequences of different adaptive mechanisms in non-stationary regression problems. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp 1–8, DOI 10.1109/IJCNN.2015.7280779
  • Bakirov et al. (2016) Bakirov R, Gabrys B, Fay D (2016) Augmenting adaptation with retrospective model correction for non-stationary regression problems. In: 2016 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 771–779, DOI 10.1109/IJCNN.2016.7727278
  • Bakirov et al. (2017a) Bakirov R, Gabrys B, Fay D (2017a) Multiple adaptive mechanisms for data-driven soft sensors. Computers & Chemical Engineering 96:42–54, DOI 10.1016/j.compchemeng.2016.08.017
  • Bakirov et al. (2017b) Bakirov R, Gabrys B, Fay D (2017b) Multiple adaptive mechanisms for data-driven soft sensors. Computers and Chemical Engineering 96, DOI 10.1016/j.compchemeng.2016.08.017
  • Basak (2006) Basak J (2006) Online adaptive decision trees: Pattern classification and function approximation. Neural computation 18(9):2062–2101
  • Bifet et al. (2009) Bifet A, Holmes G, Gavaldà R, Pfahringer B, Kirkby R (2009) New Ensemble Methods For Evolving Data Streams. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’09 pp 139–147, DOI 10.1145/1557019.1557041
  • Carpenter et al. (1991) Carpenter G, Grossberg S, Reynolds J (1991) ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural networks 4:565–588
  • Castillo and Gama (2006) Castillo G, Gama J (2006) An Adaptive Prequential Learning Framework for Bayesian Network Classifiers. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge Discovery in Databases: PKDD 2006, Springer Berlin Heidelberg, Berlin, Heidelberg, Lecture Notes in Computer Science, vol 4213, pp 67–78, DOI 10.1007/11871637
  • Chen et al. (2015) Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The UCR Time Series Classification Archive
  • Cinar et al. (2003) Cinar A, Parulekar SJ, Undey C, Birol G (2003) Batch Fermentation: Modeling: Monitoring, and Control. CRC Press
  • Dawid (1984) Dawid AP (1984) Present Position and Potential Developments: Some Personal Views: Statistical Theory: The Prequential Approach. Journal of the Royal Statistical Society Series A (General) 147(2):278, DOI 10.2307/2981683
  • Domingos and Hulten (2000) Domingos P, Hulten G (2000) Mining high-speed data streams. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’00 pp 71–80, DOI 10.1145/347090.347107
  • Duin et al. (2007)

    Duin RPW, Juszczak P, Paclik P, Pekalska E, de Ridder D, Tax DMJ, Verzakov S (2007) PRTools4.1, A Matlab Toolbox for Pattern Recognition

  • Elwell and Polikar (2011) Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council 22(10):1517–31, DOI 10.1109/TNN.2011.2160459
  • Fern and Givan (2000)

    Fern A, Givan R (2000) Dynamic feature selection for hardware prediction. Tech. rep., Purdue University

  • Feurer et al. (2015) Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015) Efficient and Robust Automated Machine Learning. In: Advances in Neural Information Processing Systems 28 (NIPS 2015), pp 2962–2970
  • Fisher (1936) Fisher RA (1936) The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7(2):179–188, DOI 10.1111/j.1469-1809.1936.tb02137.x
  • Fortuna et al. (2003) Fortuna L, Rizzo A, Sinatra M, Xibilia M (2003) Soft analyzers for a sulfur recovery unit. Control Engineering Practice 11(12):1491–1500, DOI 10.1016/S0967-0661(03)00079-0
  • Fortuna et al. (2005) Fortuna L, Graziani S, Xibilia M (2005) Soft sensors for product quality monitoring in debutanizer distillation columns. Control Engineering Practice 13(4):499–508, DOI 10.1016/J.CONENGPRAC.2004.04.013
  • Friedman and Goldszmidt (1997) Friedman N, Goldszmidt M (1997) Sequential update of Bayesian network structure. In: Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pp 165–174
  • Gomes Soares and Araújo (2015) Gomes Soares S, Araújo R (2015) An on-line weighted ensemble of regressor models to handle concept drifts. Engineering Applications of Artificial Intelligence 37:392–406, DOI 10.1016/j.engappai.2014.10.003
  • Harries (1999) Harries M (1999) Splice-2 comparative evaluation: Electricity pricing. Technical report. The University of South Wales. Tech. rep., The University of South Wales
  • Hazan and Seshadhri (2009) Hazan E, Seshadhri C (2009) Efficient learning algorithms for changing environments. In: ICML ’09 Proceedings of the 26th Annual International Conference on Machine Learning, pp 393–400
  • Herbster and Warmuth (1998) Herbster M, Warmuth M (1998) Tracking the best expert. Machine Learning 29:1–29
  • Hochreiter and Schmidhuber (1997) Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural Computation 9(8):1735–1780, DOI 10.1162/neco.1997.9.8.1735
  • Hulten et al. (2001) Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’01, ACM Press, New York, New York, USA, pp 97–106, DOI 10.1145/502512.502529
  • Hutter et al. (2011) Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential Model-Based Optimization for General Algorithm Configuration. In: LION’05 Proceedings of the 5th international conference on Learning and Intelligent Optimization, Springer, Berlin, Heidelberg, pp 507–523, DOI 10.1007/978-3-642-25566-3_40
  • Ikonomovska et al. (2010) Ikonomovska E, Gama J, Džeroski S (2010) Learning model trees from evolving data streams. Data Mining and Knowledge Discovery 23(1):128–168, DOI 10.1007/s10618-010-0201-y
  • Jang et al. (1997) Jang JSR, Sun CT, Mizutani E (1997) Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Prentice Hall
  • Joe Qin (1998) Joe Qin S (1998) Recursive PLS algorithms for adaptive data modeling. Computers & Chemical Engineering 22(4-5):503–514
  • Kadlec and Gabrys (2009) Kadlec P, Gabrys B (2009) Architecture for development of adaptive on-line prediction models. Memetic Computing 1(4):241–269, DOI 10.1007/s12293-009-0017-8
  • Kadlec and Gabrys (2010) Kadlec P, Gabrys B (2010) Adaptive on-line prediction soft sensing without historical data. In: The 2010 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8, DOI 10.1109/IJCNN.2010.5596965
  • Kadlec and Gabrys (2011) Kadlec P, Gabrys B (2011) Local learning-based adaptive soft sensor for catalyst activation prediction. AIChE Journal 57(5):1288–1301, DOI 10.1002/aic.12346
  • Klinkenberg (2004) Klinkenberg R (2004) Learning drifting concepts : Example selection vs . example weighting. Intelligent Data Analysis 8(3):281–300
  • Klinkenberg and Joachims (2000)

    Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp 487–494

  • Kolter and Maloof (2007) Kolter JZ, Maloof MA (2007) Dynamic weighted majority: An ensemble method for drifting concepts. The Journal of Machine Learning Research Volume 8,:2755–2790
  • Kotthoff et al. (2017) Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K (2017) Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 18(25):1–5
  • Kuncheva (2004) Kuncheva LI (2004) Combining Pattern Classifiers: Methods and Algorithms. Wiley-Blackwell
  • Lam (1998) Lam W (1998) Bayesian network refinement via machine learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3):240–251
  • Lin (1992)

    Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8(3-4):293–321, DOI 

    10.1007/BF00992699
  • Littlestone and Warmuth (1994) Littlestone N, Warmuth M (1994) The Weighted Majority Algorithm. Information and Computation 108(2):212–261
  • Lloyd et al. (2014) Lloyd JR, Duvenaud D, Grosse R, Tenenbaum JB, Ghahramani Z (2014) Automatic construction and natural-language description of nonparametric regression models. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI Press, pp 1242–1250
  • Mann and Whitney (1947)

    Mann HB, Whitney DR (1947) On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18(1):50–60, DOI 

    10.1214/aoms/1177730491
  • Minku et al. (2010) Minku L, White A, Xin Yao (2010) The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift. IEEE Transactions on Knowledge and Data Engineering 22(5):730–742, DOI 10.1109/TKDE.2009.156
  • Nath (2007) Nath SV (2007) Champion-challenger based predictive model selection. In: Proceedings 2007 IEEE SoutheastCon, IEEE, pp 254–254
  • Newman et al. (1998) Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases
  • Oza and Russell (2001) Oza NC, Russell S (2001) Online bagging and boosting. IN ARTIFICIAL INTELLIGENCE AND STATISTICS 2001 3:105 – 112
  • Parzen (1962)

    Parzen E (1962) On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics 33(3):pp. 1065–1076

  • Potts and Sammut (2005) Potts D, Sammut C (2005) Incremental Learning of Linear Model Trees. Machine Learning 61(1-3):5–48, DOI 10.1007/s10994-005-1121-8
  • Salganicoff (1993a) Salganicoff M (1993a) Density-Adaptive Learning and Forgetting. Technical Report No. IRCS-93-50. Tech. rep., University of Pennsylvania, Inistitute for Research in Cognitive Science
  • Salganicoff (1993b) Salganicoff M (1993b) Explicit Forgetting Algorithms for Memory Based Learning. Technical Report No. IRCS-93-49. Tech. rep., University of Pennsylvania, Inistitute for Research in Cognitive Science
  • Salvador et al. (2016) Salvador MM, Budka M, Gabrys B (2016) Automatic composition and optimisation of multicomponent predictive systems 1612.08789
  • Schlimmer and Granger (1986) Schlimmer JC, Granger RH (1986) Beyond incremental processing: Tracking Concept Drift. AAAI-86 Proceedings pp 502–507
  • Schmidt and Lipson (2007)

    Schmidt M, Lipson H (2007) Learning noise. Proceedings of the 9th annual conference on Genetic and evolutionary computation - GECCO ’07 pp 1680–1685, DOI 

    10.1145/1276958.1277289
  • Scholz and Klinkenberg (2007) Scholz M, Klinkenberg R (2007) Boosting Classifiers for Drifting Concepts. Intelligent Data Analysis 11(1):1–40
  • Souza and Araújo (2014)

    Souza F, Araújo R (2014) Online Mixture of Univariate Linear Regression Models for Adaptive Soft Sensors. In: IEEE Transactions on Industrial Informatics, vol 10, pp 937–945, DOI 

    10.1109/TII.2013.2283147
  • Stanley (2002) Stanley KO (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10(2):99–127
  • Strackeljan (2006) Strackeljan J (2006) NiSIS Competition 2006- Soft Sensor for the adaptive Catalyst Monitoring of a Multi–Tube Reactor. Tech. rep., Universität Magdeburg
  • Street and Kim (2001) Street WN, Kim YS (2001) A streaming ensemble algorithm (SEA) for large-scale classification. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining pp 377–382
  • Student (1908) Student (1908) The Probable Error of a Mean. Biometrika 6(1):1–25
  • Tsymbal et al. (2008) Tsymbal A, Pechenizkiy M, Cunningham P, Puuronen S (2008) Dynamic integration of classifiers for handling concept drift. Information Fusion 9(1):56–68, DOI 10.1016/j.inffus.2006.11.002
  • Vakil-Baghmisheh and Pavešić (2003) Vakil-Baghmisheh MT, Pavešić N (2003) A Fast Simplified Fuzzy ARTMAP Network. Neural Processing Letters 17(3):273–316, DOI 10.1023/A:1026004816362
  • Wang et al. (2003) Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’03, ACM Press, New York, New York, USA, pp 226–235, DOI 10.1145/956750.956778
  • Wasserman (2000) Wasserman L (2000) Bayesian Model Selection and Model Averaging. Journal of Mathematical Psychology 44(1):92–107, DOI 10.1006/JMPS.1999.1278
  • Werbos (1974) Werbos PJ (1974) Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Phd thesis, Harvard University
  • Widmer and Kubat (1996) Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1):69–101, DOI 10.1007/BF00116900
  • Zhu (2010) Zhu X (2010) Stream Data Mining Repository, http://www.cse.fau.edu/~xqzhu/stream.html
  • Zliobaite (2011) Zliobaite I (2011) Combining Similarity in Time and Space for Training Set Formation under Concept Drift. Intelligent Data Analysis 15(4):589–611
  • Zliobaite and Kuncheva (2010) Zliobaite I, Kuncheva LI (2010) Theoretical Window Size for Classification in the Presence of Sudden Concept Drift. Tech. rep., CS-TR-001-2010, Bangor University, UK