Handling Concept Drift via Model Reuse

09/08/2018 ∙ by Peng Zhao, et al. ∙ Nanjing University 0

In many real-world applications, data are often collected in the form of stream, and thus the distribution usually changes in nature, which is referred as concept drift in literature. We propose a novel and effective approach to handle concept drift via model reuse, leveraging previous knowledge by reusing models. Each model is associated with a weight representing its reusability towards current data, and the weight is adaptively adjusted according to the model performance. We provide generalization and regret analysis. Experimental results also validate the superiority of our approach on both synthetic and real-world datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With a rapid development in data collection technology, it is of great importance to analyze and extract knowledge from them. However, data are commonly in a streaming form and are usually collected from non-stationary environments, and thus they are evolving in nature. In other words, the joint distribution between the input feature and the target label will change, which is also referred as

concept drift in literature [Gama et al., 2014]. If we simply ignore the distribution change when learning from the evolving data stream, the performance will dramatically drop down, which are not empirically and theoretically suitable for these tasks. The concept drift problem has become one of the most challenging issues for data stream learning. It has gradually drawn researchers’ attention to design effective and theoretically sound algorithms.

Data stream with concept drift is essentially almost impossible to learn (predict) if there is not any assumption on distribution change. That is, if the underlying distribution changes arbitrarily or even adversarially, there is no hope to learn a good model to make the prediction. We share the same assumption with most of the previous work, that is, there contains some useful knowledge for future prediction in previous data. No matter sliding window based approaches [Klinkenberg and Joachims, 2000; Bifet and Gavaldà, 2007; Kuncheva and Zliobaite, 2009], forgetting based approaches [Koychev, 2000; Klinkenberg, 2004] or ensemble based approaches [Kolter and Maloof, 2005; Kolter and Maloof, 2007; Sun et al., 2018], they share the same assumption, whereas the only difference is how to utilize previous knowledge or data.

Another issue is that most previous work on handling concept drift focus on the algorithm design, only a few work consider the theoretical part [Helmbold and Long, 1994; Crammer et al., 2010; Mohri and Medina, 2012]. There are some work proposing algorithms along with theoretical analysis, for example, Kolter and Maloof [2005] provides mistake and loss bounds and guarantees that the performance of the proposed approach is relative to the performance of the base learner. Harel et al. [2014] detects concept drift via resampling and provides the bounds on differentiates based on stability analysis. However, seldom have clear theoretical guarantees, or justifications on why and how to leverage previous knowledge to fight with concept drift, especially from the generalization aspect.

In this paper, we propose a novel and effective approach for handling Concept drift via model reuse, or Condor. It consists of two modules, module aims at leveraging previous knowledge to help build the new model and update model pool, while module adaptively assigns the weights for previous models according to their performance, representing the reusability towards current data. We justify the advantage of from the aspect of generalization analysis, showing that our approach can benefit from a good weighted combination of previous models. Meanwhile, the module guarantees that the weights will concentrate on the better-fit models. Besides, we also provide the dynamic regret analysis. Empirical experiments on both synthetic and real-world datasets validate the effectiveness of our approach.

In the following, Section 2 discusses related work. Section 3 proposes our approach. Section 4 presents theoretical analysis. Section 5 reports the experimental results. Finally, we conclude the paper and discuss future work in Section 6.

2 Related Work

Concept Drift has been well-recognized in recent researches [Gama et al., 2014; Gomes et al., 2017]. Basically, if there is not any structural information about data stream, and the distribution can change arbitrarily or even adversarially, we shall not expect to learn from historical data and make any meaningful prediction. Thus, it is crucial to make assumptions about the concept drift stream. Typically, most previous work assume that the nearby data items contain more useful information w.r.t. the current data, and thus researchers propose plenty of approaches based on the sliding window and forgetting mechanisms. Sliding window based approaches maintain the nearest data items and discard old items, with a fixed or adaptive window size [Klinkenberg and Joachims, 2000; Kuncheva and Zliobaite, 2009]. Forgetting based approaches do not explicitly discard old items but downweight previous data items according to their age [Koychev, 2000; Klinkenberg, 2004]. Another important category falls into the ensemble

based approaches, as they can adaptively add or delete base classifiers and dynamically adjust weights when dealing with evolving data stream. A series work borrows the idea from boosting 

[Schapire, 1990] and online boosting Beygelzimer et al. [2015], dynamically adjust weights of classifiers. Take a few representatives, dynamic weighted majority () dynamically creates and removes weighted experts in response to changes [Kolter and Maloof, 2003; Kolter and Maloof, 2007]. Additive expert ensemble () maintains and dynamically adjusts the additive expert pool, and provides the theoretical guarantee with solid mistake and loss bounds [Kolter and Maloof, 2005]. Learning in the non-stationary environments (.) trains one new classifier for each batch of data it receives, and combines these classifiers [Elwell and Polikar, 2011]. There are plenty of approaches to learning or mining from the evolving data stream, readers can refer to a comprehensive survey [Gama et al., 2014; Gomes et al., 2017]. As for boosting and ensemble approaches, readers are recommended to read the books [Schapire and Freund, 2012; Zhou, 2012].

Our approach is kind of similar to DWM and AddExp on the surface. We all maintain a model pool and adjust weights to penalty models with poor performance. However, we differ from the model update procedure and they ignore to leverage previous knowledge and reuse models to help build new model and update model pool. Besides, our weight update strategies are also different.

Model Reuse

is an important learning problem, also named as model transfer, hypothesis transfer learning, or learning from auxiliary classifiers. The basic setting is that one desires to reuse pre-trained models to help further model building, especially when the data are too scarce to directly train a fair model. A series work lies in the idea of

biased regularization, which leverages previous models as the bias regularizer into empirical risk minimization, and achieves a good performance in plenty of scenarios [Duan et al., 2009; Tommasi et al., 2010, 2014]

. There are also some other attempts and applications like model reuse by random forests 

[Segev et al., 2017], and applying model reuse to adapt different performance measures [Li et al., 2013]. Apart from algorithm design, theoretical foundations are recently established by stability [Kuzborskij and Orabona, 2013], Rademacher complexity [Kuzborskij and Orabona, 2017] and transformation functions [Du et al., 2017].

Our paper proposes to handle concept drift problem via utilizing model reuse learning. The idea of leveraging previous knowledge is reminiscent of some previous work coping with concept drift by model reuse (transfer), like the temporal inductive transfer () approach [Forman, 2006] and the diversity and transfer-based ensemble learning () approach [Sun et al., 2018]

. Both of them are batch-style approaches, that is, they need to receive a batch of data each time, whereas ours can update either in an incremental style or a batch update mode. TIX concatenates the predictions from previous models into the feature of next data batch as the new data, and a new model is learned from the augmented data batch. DTEL chooses decision tree as the base learner, and builds a new tree by “fine-tuning” previous models by a direct tree structural adaptation. It maintains a fixed size model pool with the selection criteria based on diversity measurement. They both do not depict the reusability of previous models, which is carried out by

module in our approach. Last but not the least important, our approach is proposed with sound theoretical guarantees, in particular, we carry out a generalization justification on why and how to reuse previous models. Nevertheless, theirs are not theoretically clear in general.

3 Proposed Approach

In this section, we first illustrate the basic idea, and then identify two important modules in designing our proposed approach, i.e., and .

Specifically, we adopt a drift detection algorithm to split the whole data stream into epochs in which the distribution underlying is relatively smooth, and we only do the model update when detecting the concept drift or achieving the maximum update period

. As shown in Figure 1, the drift detector will monitor the concept drift. When the drift is detected, instead of resetting the model pool and incrementally training a new model, we aim at leveraging the knowledge in previous models to enhance the overall performance by alleviating the cold start problem.

Basically, our approach consists of two important modules,

  1. by model reuse: we leverage previous models to build the new model and update model pool, by making use of biased regularization multiple model reuse.

  2. by expert advice: we associate each previous model with a weight representing the reusability towards current data. The weights are updated according to the performance of each model, in an exponential weighted average manner.

Figure 1: Illustration of main idea: on one hand, we utilize the data items in current epoch ; on the other hand, we leverage the previous knowledge () via model reuse.

3.1 Model Update by Model Reuse

We leverage previous models to adapt the current data epoch via model reuse by biased regularization [Schölkopf et al., 2001; Tommasi et al., 2014].

Consider the -th model update as illustrated in Figure 1, we desire to leverage previous models and current data epoch to obtain a new model . With a slight abuse of notation, we denote . In this paper, we adopt linear classifier as the base model, and the model reuse by biased regularization can be formulated as

(1)

where

is the loss function, and

is the regularizer. Besides, is a positive trade-off regularization coefficient, and is the linear weighted combination of previous models, namely, , where is the weight associated with previous model , representing the reusability of each model on current data epoch.

For simplicity, in this paper, we choose the square loss with

regularization in practical implementation, essentially, Least Square Support Vector Machine (LS-SVM) 

[Suykens et al., 2002]. It is shown [Suykens et al., 2002] that the optimal solution can be expressed as , with solved by

(2)

where is the linear kernel matrix, i.e., . Besides, and are the vectors containing labels of data stream and predictions of the previous -th model, that is, and .

If the concept drift occurs very frequently or data stream accumulates for a long time, the size of model pool will explode supposing there is no delete operation. Thus, we set the maximum of model pool size as . Apparently, we can keep of all models with largest diversity as done in [Sun et al., 2018]. For simplicity, we only keep the newest ones in the model pool.

Remark 1.

The biased regularization model reuse learning (1) is not limited in binary scenario, and can be easily extended to multi-class scenario as,

(3)

where , and is the margin. We defer the notations and corresponding theoretical analyses in Section C. In addition, our approach is a framework, and can choose any multiple model reuse algorithm as the sub-routine. For instance, we can also choose model reuse by random forests [Segev et al., 2017].

0:  Data stream . Drift detector with corresponding threshold ; step size ; maximum update period (epoch size) ; model pool size .
0:  Prediction , where ; and returned model pool .
1:  Initialize model on first (or a couple of) data items: , and ;
2:  Initialize weight ;
3:  for  to  do
4:     Receive ;
5:     for  to  do
6:        ;
7:     end for
8:     ;
9:     Receive ;
10:     for  to  do
11:        ; //
12:     end for
13:     if  or (then
14:        ;
15:        ;
16:        if  then
17:           Remove the oldest model from .
18:        end if
19:        for  to  do
20:           Initialize the weights: ;
21:        end for
22:     end if
23:  end for
Algorithm 1 Condor

3.2 Weight Update by Expert Advice

After step, the weight distribution in the model pool will reinitialize. We adopt a uniform initialization: , for .

After the initialization, we update weight of each model by expert advice [Cesa-Bianchi and Lugosi, 2006]. Specifically, when the new data item comes, we receive and each previous model will provide its prediction , and the final prediction is made based on the weighted combination of expert advice (s). Next, the true label is revealed as , and we will update the weights according to the loss each model suffers, in an exponential weighted manner,

The overall procedure of proposed approach Condor is summarized in Algorithm 1.

4 Theoretical Analysis

In this section, we provide theoretical analysis both locally and globally.

  1. Local analysis: consider both generalization and regret aspects on each epoch locally;

  2. Global analysis: examine regret on the whole data stream globally.

Besides, in local analysis, we also provide the multi-class model reuse analysis, and we let it an independent subsection to better present the results.

4.1 Local Analysis

The local analysis means that we scrutinize the performance on a particular epoch. On one hand, we are concerned about the generalization ability of the model obtained by module. Second, we study the quality of learned weights by module and the cumulative regret of prediction.

Let us consider the epoch , the module reuses previous models to help built new model , as shown in Figure 1. To simplify the presentation, we introduce some notations. Suppose the length of data stream is , and is partitioned into epochs, .111Here, we start from epoch , since the first epoch cannot utilize . For epoch , we assume the distribution is identical, i.e., is a sample of points drawn i.i.d. according to distribution , where denotes its length.

First, we conduct generalization analysis on module. Define the risk and empirical risk of hypothesis (model) on epoch by

Here, with a slight abuse of notations, we also adopt to denote the index included in the epoch, and instead of for simplicity. The new model is built and updated on epoch via module, then we have the following generalization error bound.

Theorem 1.

Assume that the non-negative loss function is bounded by , and is -Lipschitz continuous. Also, assume the regularizer is a non-negative and -strongly convex function w.r.t. a norm . Given the model pool with , and denote be a linear combination of previous models, i.e., and supposing and . Let be the model returned by . Then, for any

, with probability at least

, the following holds,222We use instead of for simplicity.

where and . Besides, and .

To better present the results, we only keep the leading term w.r.t. and , and we have

(4)

where , representing the risk of reusing model on current distribution.

To better present our main result, we defer the proof of Theorem 1 in Appendix B.

Remark 2.

Eq. (4) shows that procedure enjoys an generalization bound under certain conditions. In particular, when the previous models are sufficiently good, that is, have a small risk on the current distribution (i.e., when ), we can obtain an bound, a fast rate convergence guarantee. This implies the effectiveness of leveraging and reusing previous models to current data, especially if we can reuse previous models properly (as we will illustrate in the following paragraph).

Remark 3.

The main techniques in the proof are inspired by Kuzborskij and Orabona [2017], but we differ in two aspects. First, we only assume the Lipschitz condition, and thus their results are not suitable under our conditions. Second, we extended the analysis of model reuse to multi-class scenarios, and include the results in Appendix C for a better presentation.

Before the next presentation, we need to introduce more notations. Let be as the global cumulative loss on the whole data stream . on epoch , let be as the local cumulative loss suffered by our approach, and as the local cumulative loss suffered by the previous model ,

(5)

Next, we show returns a good weight distribution, implying our approach can reuse previous models properly. In fact, we have the following observation regarding to weight distribution.

Observation 1 (Weight Concentration).

During the procedure in epoch , the weights will concentrate on those previous models who suffer a small cumulative loss on .

Proof.

By a simple analysis on the procedure, we know that the weight associated with the -th previous model is equal to , where . ∎

Though the observation seems quite straightforward, it plays an important role in making our approach successful. The statement guarantees that the algorithm adaptively assigns more weights on better-fit previous models, which essentially depicts the ‘reusability’ of each model. We also conduct additional experiments to support this point in Appendix D.1.

Third, we show that our approach can benefit from recurring concept drift scenarios. Here, we adopt the concept of cumulative regret (or regret) from online learning [Zinkevich, 2003] as the performance measurement.

Theorem 2 (Improved Local Regret [Cesa-Bianchi and Lugosi, 2006]).

Assume that the loss function is convex in its first argument and takes the values in . Besides, the step size is set as , where is the cumulative loss of the best-fit previous model and is supposed to be known in advance. Then, we have,

Proof.

Refer to the proof presented in page 21 in Chapter 2.4 of Cesa-Bianchi and Lugosi [2006]. ∎

Above statement shows that the order of regret bound can be substantially improved from a typical to , independent from the number of data items in the epoch, providing , that is, the cumulative loss of the best-fit previous model is small.

Remark 4.

Theorem 2 implies that if the concept of epoch or a similar concept has emerged previously, our approach enjoys a substantially improved local regret providing a proper step size is chosen. This accords to our intuition on why model reuse helps for concept drift data stream. In many situations, although the distribution underlying might change over time, the concepts can be recurring, i.e., disappear and re-appear [Katakis et al., 2010; Gama and Kosina, 2014]. Thus, the statement shows that our approach can benefit from such recurring concepts, and we empirically support this point in Appendix D.2.

4.2 Global Analysis

The global analysis means that we study the overall performance on the whole data stream. We provide the global dynamic regret as follows.

Theorem 3 (Global Dynamic Regret).

Assume that the loss function is convex in its first argument and takes the values in . Assume that the step size in epoch is set as ,333The choice of requires the knowledge of epoch size, which can be eliminated by doubling trick, at the price of a small constant factor [Cesa-Bianchi et al., 1997]. then we have

(6)

where .

The proof of global dynamic regret is built on the local static regret analysis in each epoch. And we can see that for data stream with a fix length , the more concept drifts occur (i.e., larger ), the larger the regret will be. This accords with our intuition, on one hand, the sum of best-fit local cumulative loss () is going to be compressed with more previous models. On the other hand, the learning problem becomes definitely harder as concept drift occurs more frequently.

Our strategy is essentially exponentially weighted average forecaster [Cesa-Bianchi and Lugosi, 2006], and thus we have the following local regret guarantee in each epoch.

Lemma 1 (Theorem 2.2 in Cesa-Bianchi and Lugosi [2006]).

Assume that the loss function is convex in its first argument and takes the values in . Assume that the step size is set as , then we have

Proof.

The proof is based on a simple reduction from our scenario to standard exponentially weighted average forecaster. For epoch , let previous models pool be as the expert pool. Then, plugging the expert number and number of instances into Theorem 2.2 in Cesa-Bianchi and Lugosi [2006], we obtain the statement.

Besides, the proof of exponentially weighted average forecaster is standard, which utilizes potential function method [Cesa-Bianchi and Lugosi, 2006; Mohri et al., 2012]. For a detailed proof, one can refer to the proof presented in page 157-159 in Chapter 7 of book [Mohri et al., 2012]. ∎

Now, we proceed to prove Theorem 3.

Proof.

Our proof relies on the application of local static regret analysis. Since is a partition of the whole period , we apply Lemma 1 on each epoch locally and obtain

(7)

Sum over the index of from to , we have

(8)

where (8) holds by substituting (7) into each epoch , and (4.2) holds by applying Cauchy-Schwartz inequality. ∎

Remark 5.

Essentially, the regret bound in (6) is different from the traditional (static) regret bound. It measures the difference between the global cumulative loss with the sum of local cumulative loss suffered by previous best-fit models. Namely, our competitor changes in each epoch, which depicts the distribution change in the sequence, and thus is more suitable to be the performance measurement in non-stationary environments.

4.3 Multi-Class Model Reuse Learning

In multi-class learning scenarios, the notations are slightly different from those in binary case. We first introduce the new notations for a clear presentation.

Let denote the input feature space and denote the target label space. Our analysis acts on the last data epoch , a sample of points drawn i.i.d. according to distribution , where and with only a single class from . Given the multi-class hypothesis set , any hypothesis maps from , and makes the prediction by . This naturally rises the definition of margin of the hypothesis at a labeled instance ,

The non-negative loss function is bounded by . Besides, we assume the loss function is regular loss defined in Lei et al. [2015].

Definition 1 (Regular Loss).

We call a loss function is a -regular if it satisfies the following properties (Cf. Definition 2 in Lei et al. [2015]):

  1. bounds the - loss from above: ;

  2. is -Lipschitz continuous, i.e., ;

  3. is decreasing and it has a zero point , i.e., there exists a such that .

Then the risk and empirical risk of a hypothesis on epoch are defined by

Our goal is to provide a generalization analysis, namely, to prove that risk approaches empirical risk as number of instances increases, and establish the convergence rate. Since , thus, we cannot directly utilize concentration inequalities to help analysis. To make this simpler, we need to introduce the risk w.r.t. loss function ,

From property c in Definition 1, we know that the risk is a lower bound on , that is . Thus, we only need to establish generalization bound between and . Apparently, , thus we can utilize concentration inequalities again.

First, we identify the optimization formulation of multi-class biased regularization model reuse,

(10)

where .

We specify the regularizer as square of Frobenius norm, namely, , and provide the following generalization error bound.

Theorem 4.

Let be a hypothesis set with . Assume that the non-negative loss function is -regular. Given the model pool with , and denote be a linear combination of previous models, i.e., and supposing and . Let be the model returned by . Then, for any , with probability at least , the following holds,444We use instead of for simplicity.

where and . Besides, and .

To better present the results, we only keep the leading term w.r.t. and , and we have

(11)

where , representing the risk of reusing model on current distribution.

Remark 6.

From Theorem 4, we can see that the main result and conclusion in multi-class case is very similar to that in binary case. In (11), we can see that Condor enjoys an order generalization bound, which is consistent to the common learning guarantees. More importantly, Condor enjoys an order fast rate generalization guarantees, when , namely, when the previous model is highly ‘reusable’ with respect to the current data. This shows the effectiveness of leveraging and reusing previous models to help build new model in Condor, in multi-class scenarios.

5 Experiments

In this section, we examine the effectiveness of Condor on both synthetic and real-world concept drift datasets. Additional experimental regarding to weight concentration, recurring concept drift, parameter study and robustness comparisons are presented in Appendix D.

Compared Approaches. We conduct the comparisons with two classes of state-of-the-art concept drift approaches. The first class is the ensemble category, including (a)  [Elwell and Polikar, 2011], (b)  [Kolter and Maloof, 2003; Kolter and Maloof, 2007] and (c)  [Kolter and Maloof, 2005]. The second class is the model-reuse category, including (d)  [Sun et al., 2018] and (e)  [Forman, 2006]. Essentially, DTEL and TIX also adopt ensemble idea, we classify them into model-reuse category just to highlight their model reuse strategies.

Settings. In our experiments, we choose ADWIN algorithm [Bifet and Gavaldà, 2007] as the the drift detector with default parameter setting reported in their paper and source code. Besides, for all the approaches, we set the maximum update period (epoch size) ,555Except for Covertype and GasSensor datasets, we set , since Covertype is extremely large with 581,012 data items in total and GasSensor is the multi-class dataset which has a higher sample complexity. and model pool size .

Dataset # instance # dim # class Dataset # instance # dim # class
SEA200A 24,000 3 2 GEARS-2C-2D 200,000 2 2
SEA200G 24,000 3 2 Usenet-1 1,500 100 2
SEA500G 60,000 3 2 Usenet-2 1,500 100 2
CIR500G 60,000 3 2 Luxembourg 1,900 32 2
SINE500G 60,000 2 2 Spam 9,324 500 2
STA500G 60,000 3 2 Email 1,500 913 2
1CDT 16,000 2 2 Weather 18,159 8 2
1CHT 16,000 2 2 GasSensor 4,450 129 6
UG-2C-2D 100,000 2 2 Powersupply 29,928 2 2
UG-2C-3D 200,000 3 2 Electricity 45,312 8 2
UG-2C-5D 200,000 5 2 Covertype 581,012 54 2
Table 1: Basic statistics of datasets with concept drift.

Synthetic Datasets. As it is not realistic to foreknow the detailed concept drift information of real-world datasets, like the start, the end of change and so on. We employ six widely used synthetic datasets SEA, CIR, SIN and STA with corresponding variants into experiments. Besides, another six synthetic datasets for binary classification are also adopted: 1CDT, 1CHT, UG-2C-2D, UG-2C-3D, UG-2C-5D, and GEARS-2C-2D. A brief statistics are summarized in Table 1. We provide the datasets information in Appendix E.

We plot the holdout accuracy comparisons over three synthetic datasets, SEA200A, SEA200G and SEA500G. Since some of compared approaches are batch style, following the splitting setting in Sun et al. [2018], we split them into 120 epochs to have a clear presentation. The holdout accuracy is calculated over testing data generated according to the identical distribution as training data at each time stamp. For SEA and its variants, the distribution changes for seven times. From Figure 2, we can see that all the approaches drop when an abrupt concept drift occurs. Nevertheless, our approach Condor is relatively stable and rises up rapidly with more data items coming, with the highest accuracy compared with other approaches, which validates its effectiveness.

(a) SEA200A
(b) SEA200G
(c) SEA500G
Figure 2: Holdout accuracy comparisons on three synthetic datasets.
Dataset Ensemble Category Model-Reuse Category Ours
Condor
SEA200A 84.48 0.19 86.07 0.30 84.35 0.86 80.50 0.58 82.79 0.27 86.67 0.21
SEA200G 85.48 0.33 86.92 0.13 85.54 0.69 80.73 0.19 82.95 0.12 87.63 0.24
SEA500G 86.03 0.19 87.63 0.06 87.14 0.12 80.42 0.24 83.26 0.07 88.21 0.04
CIR500G 84.77 0.56 77.09 0.71 76.48 0.81 79.03 0.34 66.38 0.85 68.41 0.87
SIN500G 79.41 0.07 66.99 0.10 66.81 0.12 74.93 0.34 62.73 0.14 65.68 0.12
STA500G 83.97 0.13 87.43 0.18 86.89 0.27 88.26 0.18 85.95 0.07 88.60 0.07
1CDT 99.77 0.14 99.92 0.10 99.92 0.10 99.69 0.11 99.56 0.08 99.95 0.04
1CHT 99.69 0.20 99.71 0.28 99.56 0.46 92.08 0.22 99.41 0.22 99.86 0.13
UG-2C-2D 94.42 0.12 95.60 0.12 94.36 0.78 93.98 0.13 94.69 0.13 95.27 0.09
UG-2C-3D 93.82 0.60 95.23 0.59 94.61 0.73 92.94 0.72 94.31 0.69 94.84 0.59
UG-2C-5D 90.30 0.30 92.85 0.23 92.20 0.23 88.21 0.35 89.84 0.38 91.83 0.24
GEARS-2C-2D 95.82 0.02 95.83 0.02 95.83 0.02 94.96 0.03 95.03 0.02 95.91 0.01
Usenet-1 63.76 2.01 67.26 3.11 62.11 2.67 68.02 1.19 65.03 1.70 73.13 1.12
Usenet-2 72.42 1.14 68.41 1.17 70.55 2.41 72.02 0.87 70.56 1.15 75.13 1.06
Luxembourg 98.64 0.00 90.42 0.55 90.77 0.52 100.0 0.00 90.99 0.97 99.98 0.03
Spam 90.79 0.85 92.18 0.34 91.78 0.33 85.53 1.22 87.10 1.45 95.22 0.48
Email 74.21 4.61 72.58 4.10 60.78 6.12 83.36 1.87 79.83 3.73 91.60 1.86
Weather 75.99 0.36 70.83 0.49 70.07 0.34 68.92 0.27 70.21 0.33 79.37 0.26
GasSensor 42.36 3.72 76.61 0.36 76.61 0.36 63.82 3.64 43.40 2.88 81.57 3.77
Powersupply 74.06 0.28 72.09 0.29 72.13 0.23 69.90 0.38 68.34 0.16 72.82 0.29
Electricity 78.97 0.18 78.03 0.17 75.62 0.42 81.05 0.35 58.44 0.71 84.73 0.33
Covertype 79.08 1.30 74.17 0.87 73.13 1.53 69.43 1.30 64.60 0.89 89.58 0.14
Condor  w/t/l 18/ 1/ 3 14/ 3/ 4 17/ 2/ 3 19/ 1/ 2 22/ 0/ 0 rank first 16/ 22
Table 2: Performance comparisons on synthetic and real-world datasets. Besides, () indicates our approach Condor is significantly better (worse) than compared approaches (paired -tests at 95% significance level).

Real-world Datasets. We adopt 10 real-world datasets: Usenet-1, Usenet-2, Luxembourg, Spam, Email, Weather, GasSensor, Powersupply, Electricity and Covertype. The number of data items varies from 1,500 to 581,012, and the class number varies from 2 to 6. Detailed descriptions are provided in Appendix E.2

. We conduct all the experiments for 10 trails and report overall mean and standard deviation of predictive accuracy in Table 

2, synthetic datasets are also included. As we can see, Condor has a significant advantage over other comparisons. Actually, it achieves the best on 16 over 22 datasets in total. Besides, it ranks the second on four other datasets. The reason Condor behaves poor on CIR500G and SIN500G is that these two datasets are highly nonlinear (generated by a circle and sine function, respectively.). It is also noteworthy that Condor behaves significant better than other approaches in real-world datasets. These show the superiority of our proposed approach.

6 Conclusion

In this paper, a novel and effective approach Condor is proposed for handling concept drift via model reuse, which consists of two key components, and . Our approach is built on a drift detector, when a drift is detected or a maximum accumulation number is achieved, leverages and reuses previous models in a weighted manner. Meanwhile, adaptively adjusts weights of previous models according to their performance. By the generalization analysis, we prove that the model reuse strategy helps if we properly reuse previous models. Through regret analysis, we show that the weight concentrate on those better-fit models, and the approach achieves a fair dynamic cumulative regret. Empirical results show the superiority of our approach to other comparisons, on both synthetic and real-world datasets.

In the future, it would be interesting to incorporate more techniques from model reuse learning into handling concept drift problems.

References

  • Bartlett and Mendelson [2002] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.

    Journal of Machine Learning Research

    , 3:463–482, 2002.
  • Beygelzimer et al. [2015] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, ICML, pages 2323–2331, 2015.
  • Bifet and Gavaldà [2007] Albert Bifet and Ricard Gavaldà. Learning from time-changing data with adaptive windowing. In Proceedings of the Seventh SIAM International Conference on Data Mining, pages 443–448, 2007.
  • Bousquet et al. [2003] Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi.

    Introduction to statistical learning theory.

    In Advanced Lectures on Machine Learning, ML Summer Schools 2003, pages 169–207, 2003.
  • Bousquet [2002] Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. PhD thesis, Ecole Polytechnique, 2002.
  • Cattral et al. [2002] Robert Cattral, Franz Oppacher, and Dwight Deugo. Evolutionary data mining with automatic rule generalization. 2002.
  • Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Cesa-Bianchi et al. [1997] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
  • Chen et al. [2015] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive. 2015.
  • Crammer et al. [2010] Koby Crammer, Yishay Mansour, Eyal Even-Dar, and Jennifer Wortman Vaughan. Regret minimization with concept drift. In

    Annual Conference Computational Learning Theory, COLT

    , pages 168–180, 2010.
  • de Souza et al. [2015] Vinícius M. A. de Souza, Diego Furtado Silva, João Gama, and Gustavo E. A. P. A. Batista. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SIAM International Conference on Data Mining, SDM, pages 873–881, 2015.
  • Du et al. [2017] Simon S. Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. In Advances in Neural Information Processing Systems, NIPS, pages 574–584, 2017.
  • Duan et al. [2009] Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple sources via auxiliary classifiers. In International Conference on Machine Learning, ICML, pages 289–296, 2009.
  • Elwell and Polikar [2011] Ryan Elwell and Robi Polikar. Incremental learning of concept drift in nonstationary environments.

    IEEE Transactions on Neural Networks

    , 22(10):1517–1531, 2011.
  • Forman [2006] George Forman. Tackling concept drift by temporal inductive transfer. In International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR, pages 252–259, 2006.
  • Gama and Kosina [2014] João Gama and Petr Kosina. Recurrent concepts in data streams classification. Knowledge and Information Systems, 40(3):489–507, 2014.
  • Gama et al. [2003] João Gama, Ricardo Rocha, and Pedro Medas. Accurate decision trees for mining high-speed data streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 523–528, 2003.
  • Gama et al. [2014] João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 46(4):44:1–44:37, 2014.
  • Gomes et al. [2017] Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. A survey on ensemble learning for data stream classification. ACM Computing Surveys, 50(2):23:1–23:36, 2017.
  • Harel et al. [2014] Maayan Harel, Shie Mannor, Ran El-Yaniv, and Koby Crammer. Concept drift detection through resampling. In International Conference on Machine Learning, ICML, pages 1009–1017, 2014.
  • Harries and Wales [1999] Michael Harries and New South Wales. Splice-2 comparative evaluation: Electricity pricing. Technical Report of South Wales University, 1999.
  • Helmbold and Long [1994] David P. Helmbold and Philip M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27–45, 1994.
  • Jaber et al. [2013] Ghazal Jaber, Antoine Cornuéjols, and Philippe Tarroux. A new on-line learning method for coping with recurring concepts: The ADACC system. In International Conference on Neural Information Processing, ICONIP, pages 595–604, 2013.
  • Kakade et al. [2012] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865–1890, 2012.
  • Katakis et al. [2008] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis P. Vlahavas. An ensemble of classifiers for coping with recurring contexts in data streams. In

    European Conference on Artificial Intelligence, ECAI

    , pages 763–764, 2008.
  • Katakis et al. [2009] Ioannis Katakis, Grigorios Tsoumakas, Evangelos Banos, Nick Bassiliades, and Ioannis P. Vlahavas. An adaptive personalized news dissemination system. Journal of Intelligent Information Systems, 32(2):191–212, 2009.
  • Katakis et al. [2010] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis P. Vlahavas. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems, 22(3):371–391, 2010.
  • Klinkenberg and Joachims [2000] Ralf Klinkenberg and Thorsten Joachims. Detecting concept drift with support vector machines. In International Conference on Machine Learning, ICML, pages 487–494, 2000.
  • Klinkenberg [2004] Ralf Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281–300, 2004.
  • Kolter and Maloof [2003] Jeremy Z. Kolter and Marcus A. Maloof. Dynamic weighted majority: A new ensemble method for tracking concept drift. In IEEE International Conference on Data Mining, ICDM, pages 123–130, 2003.
  • Kolter and Maloof [2005] Jeremy Z. Kolter and Marcus A. Maloof. Using additive expert ensembles to cope with concept drift. In International Conference on Machine Learning, ICML, pages 449–456, 2005.
  • Kolter and Maloof [2007] J. Zico Kolter and Marcus A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research, 8:2755–2790, 2007.
  • Koychev [2000] Ivan Koychev. Gradual forgetting for adaptation to concept drift. In In Proceedings of ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning, pages 101–106, 2000.
  • Kuncheva and Zliobaite [2009] Ludmila I. Kuncheva and Indre Zliobaite. On the window size for classification in changing environments. Intelligent Data Analysis, 13(6):861–872, 2009.
  • Kuzborskij and Orabona [2013] Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In International Conference on Machine Learning, ICML, pages 942–950, 2013.
  • Kuzborskij and Orabona [2017] Ilja Kuzborskij and Francesco Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106(2):171–195, 2017.
  • Ledoux and Talagrand [2013] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
  • Lei et al. [2015] Yunwen Lei, Ürün Dogan, Alexander Binder, and Marius Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, NIPS, pages 2035–2043, 2015.
  • Li et al. [2013] Nan Li, Ivor W. Tsang, and Zhi-Hua Zhou. Efficient optimization of performance measures by classifier adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6):1370–1382, 2013.
  • Maurer [2016] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, ALT, pages 3–17, 2016.
  • Mohri and Medina [2012] Mehryar Mohri and Andres Muñoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT, pages 124–138, 2012.
  • Mohri et al. [2012] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
  • Schapire and Freund [2012] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.
  • Schapire [1990] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.
  • Schlimmer and Granger [1986] Jeffrey C. Schlimmer and Richard H. Granger. Incremental learning from noisy data. Machine Learning, 1(3):317–354, 1986.
  • Schölkopf et al. [2001] Bernhard Schölkopf, Ralf Herbrich, and Alexander J. Smola. A generalized representer theorem. In Annual Conference Computational Learning Theory, COLT, pages 416–426, 2001.
  • Segev et al. [2017] Noam Segev, Maayan Harel, Shie Mannor, Koby Crammer, and Ran El-Yaniv. Learn on source, refine on target: A model transfer learning framework with random forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1811–1824, 2017.
  • Srebro et al. [2010] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, NIPS, pages 2199–2207. 2010.
  • Street and Kim [2001] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 377–382, 2001.
  • Sun et al. [2018] Y. Sun, K. Tang, Z. Zhu, and X. Yao. Concept drift adaptation by exploiting historical knowledge. IEEE Transactions on Neural Networks and Learning Systems, To appear, 2018.
  • Suykens et al. [2002] Johan AK Suykens, Tony Van Gestel, and Jos De Brabanter. Least squares support vector machines. World Scientific, 2002.
  • Tommasi et al. [2010] Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR

    , pages 3081–3088, 2010.
  • Tommasi et al. [2014] Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Learning categories from few examples with multi model knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):928–941, 2014.
  • Vergara et al. [2012] Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A Ryan, Margie L Homer, and Ramón Huerta. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320–329, 2012.
  • Vlachos et al. [2002] Michail Vlachos, Carlotta Domeniconi, Dimitrios Gunopulos, George Kollios, and Nick Koudas. Non-linear dimensionality reduction techniques for classification and visualization. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 645–651, 2002.
  • Zhou [2012] Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC Press, 2012.
  • Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, ICML, pages 928–936, 2003.
  • Zliobaite [2011] Indre Zliobaite. Combining similarity in time and space for training set formation under concept drift. Intelligent Data Analysis, 15(4):589–611, 2011.

Appendix A Prerequisite Knowledge and Technical Lemmas

In this section, we introduce prerequisite knowledge for proving main results technical lemmas. Specifically, we will utilize Rademacher complexity [Bartlett and Mendelson, 2002] in proving generalization bounds. Besides, we will also exploit the function properties when bounding Rademacher complexity and proving the regret bounds.

a.1 Rademacher Complexity

To simplify the presentation, first, we introduce some notations. Let be a sample of points drawn i.i.d. according to distribution , then the risk and empirical risk of hypothesis are defined by

In the following, we will utilize the notion of Rademacher complexity [Bartlett and Mendelson, 2002] to measure the hypothesis complexity and use it to bound the generalization error.

Definition 2.

(Rademacher Complexity [Bartlett and Mendelson, 2002]) Let be a family of functions and a fixed sample of size as . Then, the empirical Rademacher complexity of with respect to the sample is defined as:

Besides, the Rademacher complexity of is the expectation of the empirical Rademacher complexity over all samples of size drawn according to :

(12)

a.2 Function Properties

In this paragraph, we introduce several common and useful functional properties.

Definition 3 (Lipschitz Continuity).

A function is -Lipschitz continuous w.r.t. a norm over domain if for all , we have

Definition 4 (Strong Convexity).

A function is -strongly convex w.r.t. a norm if for all and for any , we have

A common and equivalent form for differentiable case is,

(13)
Definition 5 (Smoothness).

A function is -smooth w.r.t. a norm if for all , we have

If is differentiable, the above condition is equivalent to a Lipschitz condition over the gradients,

a.3 Technical Lemmas

To obtain a fast generalization rate, essentially, we need a Bernstein-type concentration inequality. And we adopt the functional generalization of Bennett’s inequality due to Bousquet [Bousquet, 2002], for self-containedness, we state the conclusion in Lemma 2 as follow.

Lemma 2 (Theorem 2.11 in Bousquet [2002]).

Assume the are identically distributed according to . Let be a countable set of functions from to and assume that all functions in are -measurable, square-integrable and satisfy . If then we denote

and if , can be defined as above or as

Let be a positive real number such that almost surely, then for all , we have

with and , also

Besides, for a strongly convex regularizer, we have following property, which will be useful in proving Theorem 1.

Lemma 3 (Corollary 4 in Kakade et al. [2012]).

If is -strongly convex w.r.t. a norm and , then, denoting the partial sum by , we have for any sequence and for any ,

Lemma 4 (Lemma 8.1 in Mohri et al. [2012]).

Let be hypothesis sets in , and let . Then, for any sample of size , the empirical Rademacher complexity of can be upper bounded as follows: