With a rapid development in data collection technology, it is of great importance to analyze and extract knowledge from them. However, data are commonly in a streaming form and are usually collected from non-stationary environments, and thus they are evolving in nature. In other words, the joint distribution between the input feature and the target label will change, which is also referred asconcept drift in literature [Gama et al., 2014]. If we simply ignore the distribution change when learning from the evolving data stream, the performance will dramatically drop down, which are not empirically and theoretically suitable for these tasks. The concept drift problem has become one of the most challenging issues for data stream learning. It has gradually drawn researchers’ attention to design effective and theoretically sound algorithms.
Data stream with concept drift is essentially almost impossible to learn (predict) if there is not any assumption on distribution change. That is, if the underlying distribution changes arbitrarily or even adversarially, there is no hope to learn a good model to make the prediction. We share the same assumption with most of the previous work, that is, there contains some useful knowledge for future prediction in previous data. No matter sliding window based approaches [Klinkenberg and Joachims, 2000; Bifet and Gavaldà, 2007; Kuncheva and Zliobaite, 2009], forgetting based approaches [Koychev, 2000; Klinkenberg, 2004] or ensemble based approaches [Kolter and Maloof, 2005; Kolter and Maloof, 2007; Sun et al., 2018], they share the same assumption, whereas the only difference is how to utilize previous knowledge or data.
Another issue is that most previous work on handling concept drift focus on the algorithm design, only a few work consider the theoretical part [Helmbold and Long, 1994; Crammer et al., 2010; Mohri and Medina, 2012]. There are some work proposing algorithms along with theoretical analysis, for example, Kolter and Maloof  provides mistake and loss bounds and guarantees that the performance of the proposed approach is relative to the performance of the base learner. Harel et al.  detects concept drift via resampling and provides the bounds on differentiates based on stability analysis. However, seldom have clear theoretical guarantees, or justifications on why and how to leverage previous knowledge to fight with concept drift, especially from the generalization aspect.
In this paper, we propose a novel and effective approach for handling Concept drift via model reuse, or Condor. It consists of two modules, module aims at leveraging previous knowledge to help build the new model and update model pool, while module adaptively assigns the weights for previous models according to their performance, representing the reusability towards current data. We justify the advantage of from the aspect of generalization analysis, showing that our approach can benefit from a good weighted combination of previous models. Meanwhile, the module guarantees that the weights will concentrate on the better-fit models. Besides, we also provide the dynamic regret analysis. Empirical experiments on both synthetic and real-world datasets validate the effectiveness of our approach.
2 Related Work
Concept Drift has been well-recognized in recent researches [Gama et al., 2014; Gomes et al., 2017]. Basically, if there is not any structural information about data stream, and the distribution can change arbitrarily or even adversarially, we shall not expect to learn from historical data and make any meaningful prediction. Thus, it is crucial to make assumptions about the concept drift stream. Typically, most previous work assume that the nearby data items contain more useful information w.r.t. the current data, and thus researchers propose plenty of approaches based on the sliding window and forgetting mechanisms. Sliding window based approaches maintain the nearest data items and discard old items, with a fixed or adaptive window size [Klinkenberg and Joachims, 2000; Kuncheva and Zliobaite, 2009]. Forgetting based approaches do not explicitly discard old items but downweight previous data items according to their age [Koychev, 2000; Klinkenberg, 2004]. Another important category falls into the ensemble
based approaches, as they can adaptively add or delete base classifiers and dynamically adjust weights when dealing with evolving data stream. A series work borrows the idea from boosting[Schapire, 1990] and online boosting Beygelzimer et al. , dynamically adjust weights of classifiers. Take a few representatives, dynamic weighted majority () dynamically creates and removes weighted experts in response to changes [Kolter and Maloof, 2003; Kolter and Maloof, 2007]. Additive expert ensemble () maintains and dynamically adjusts the additive expert pool, and provides the theoretical guarantee with solid mistake and loss bounds [Kolter and Maloof, 2005]. Learning in the non-stationary environments (.) trains one new classifier for each batch of data it receives, and combines these classifiers [Elwell and Polikar, 2011]. There are plenty of approaches to learning or mining from the evolving data stream, readers can refer to a comprehensive survey [Gama et al., 2014; Gomes et al., 2017]. As for boosting and ensemble approaches, readers are recommended to read the books [Schapire and Freund, 2012; Zhou, 2012].
Our approach is kind of similar to DWM and AddExp on the surface. We all maintain a model pool and adjust weights to penalty models with poor performance. However, we differ from the model update procedure and they ignore to leverage previous knowledge and reuse models to help build new model and update model pool. Besides, our weight update strategies are also different.
is an important learning problem, also named as model transfer, hypothesis transfer learning, or learning from auxiliary classifiers. The basic setting is that one desires to reuse pre-trained models to help further model building, especially when the data are too scarce to directly train a fair model. A series work lies in the idea ofbiased regularization, which leverages previous models as the bias regularizer into empirical risk minimization, and achieves a good performance in plenty of scenarios [Duan et al., 2009; Tommasi et al., 2010, 2014]
. There are also some other attempts and applications like model reuse by random forests[Segev et al., 2017], and applying model reuse to adapt different performance measures [Li et al., 2013]. Apart from algorithm design, theoretical foundations are recently established by stability [Kuzborskij and Orabona, 2013], Rademacher complexity [Kuzborskij and Orabona, 2017] and transformation functions [Du et al., 2017].
Our paper proposes to handle concept drift problem via utilizing model reuse learning. The idea of leveraging previous knowledge is reminiscent of some previous work coping with concept drift by model reuse (transfer), like the temporal inductive transfer () approach [Forman, 2006] and the diversity and transfer-based ensemble learning () approach [Sun et al., 2018]
. Both of them are batch-style approaches, that is, they need to receive a batch of data each time, whereas ours can update either in an incremental style or a batch update mode. TIX concatenates the predictions from previous models into the feature of next data batch as the new data, and a new model is learned from the augmented data batch. DTEL chooses decision tree as the base learner, and builds a new tree by “fine-tuning” previous models by a direct tree structural adaptation. It maintains a fixed size model pool with the selection criteria based on diversity measurement. They both do not depict the reusability of previous models, which is carried out bymodule in our approach. Last but not the least important, our approach is proposed with sound theoretical guarantees, in particular, we carry out a generalization justification on why and how to reuse previous models. Nevertheless, theirs are not theoretically clear in general.
3 Proposed Approach
In this section, we first illustrate the basic idea, and then identify two important modules in designing our proposed approach, i.e., and .
Specifically, we adopt a drift detection algorithm to split the whole data stream into epochs in which the distribution underlying is relatively smooth, and we only do the model update when detecting the concept drift or achieving the maximum update period. As shown in Figure 1, the drift detector will monitor the concept drift. When the drift is detected, instead of resetting the model pool and incrementally training a new model, we aim at leveraging the knowledge in previous models to enhance the overall performance by alleviating the cold start problem.
Basically, our approach consists of two important modules,
by model reuse: we leverage previous models to build the new model and update model pool, by making use of biased regularization multiple model reuse.
by expert advice: we associate each previous model with a weight representing the reusability towards current data. The weights are updated according to the performance of each model, in an exponential weighted average manner.
3.1 Model Update by Model Reuse
Consider the -th model update as illustrated in Figure 1, we desire to leverage previous models and current data epoch to obtain a new model . With a slight abuse of notation, we denote . In this paper, we adopt linear classifier as the base model, and the model reuse by biased regularization can be formulated as
is the loss function, andis the regularizer. Besides, is a positive trade-off regularization coefficient, and is the linear weighted combination of previous models, namely, , where is the weight associated with previous model , representing the reusability of each model on current data epoch.
For simplicity, in this paper, we choose the square loss with
regularization in practical implementation, essentially, Least Square Support Vector Machine (LS-SVM)[Suykens et al., 2002]. It is shown [Suykens et al., 2002] that the optimal solution can be expressed as , with solved by
where is the linear kernel matrix, i.e., . Besides, and are the vectors containing labels of data stream and predictions of the previous -th model, that is, and .
If the concept drift occurs very frequently or data stream accumulates for a long time, the size of model pool will explode supposing there is no delete operation. Thus, we set the maximum of model pool size as . Apparently, we can keep of all models with largest diversity as done in [Sun et al., 2018]. For simplicity, we only keep the newest ones in the model pool.
The biased regularization model reuse learning (1) is not limited in binary scenario, and can be easily extended to multi-class scenario as,
where , and is the margin. We defer the notations and corresponding theoretical analyses in Section C. In addition, our approach is a framework, and can choose any multiple model reuse algorithm as the sub-routine. For instance, we can also choose model reuse by random forests [Segev et al., 2017].
3.2 Weight Update by Expert Advice
After step, the weight distribution in the model pool will reinitialize. We adopt a uniform initialization: , for .
After the initialization, we update weight of each model by expert advice [Cesa-Bianchi and Lugosi, 2006]. Specifically, when the new data item comes, we receive and each previous model will provide its prediction , and the final prediction is made based on the weighted combination of expert advice (s). Next, the true label is revealed as , and we will update the weights according to the loss each model suffers, in an exponential weighted manner,
The overall procedure of proposed approach Condor is summarized in Algorithm 1.
4 Theoretical Analysis
In this section, we provide theoretical analysis both locally and globally.
Local analysis: consider both generalization and regret aspects on each epoch locally;
Global analysis: examine regret on the whole data stream globally.
Besides, in local analysis, we also provide the multi-class model reuse analysis, and we let it an independent subsection to better present the results.
4.1 Local Analysis
The local analysis means that we scrutinize the performance on a particular epoch. On one hand, we are concerned about the generalization ability of the model obtained by module. Second, we study the quality of learned weights by module and the cumulative regret of prediction.
Let us consider the epoch , the module reuses previous models to help built new model , as shown in Figure 1. To simplify the presentation, we introduce some notations. Suppose the length of data stream is , and is partitioned into epochs, .111Here, we start from epoch , since the first epoch cannot utilize . For epoch , we assume the distribution is identical, i.e., is a sample of points drawn i.i.d. according to distribution , where denotes its length.
First, we conduct generalization analysis on module. Define the risk and empirical risk of hypothesis (model) on epoch by
Here, with a slight abuse of notations, we also adopt to denote the index included in the epoch, and instead of for simplicity. The new model is built and updated on epoch via module, then we have the following generalization error bound.
Assume that the non-negative loss function is bounded by , and is -Lipschitz continuous. Also, assume the regularizer is a non-negative and -strongly convex function w.r.t. a norm . Given the model pool with , and denote be a linear combination of previous models, i.e., and supposing and . Let be the model returned by . Then, for any
, with probability at least, the following holds,222We use instead of for simplicity.
where and . Besides, and .
To better present the results, we only keep the leading term w.r.t. and , and we have
where , representing the risk of reusing model on current distribution.
Eq. (4) shows that procedure enjoys an generalization bound under certain conditions. In particular, when the previous models are sufficiently good, that is, have a small risk on the current distribution (i.e., when ), we can obtain an bound, a fast rate convergence guarantee. This implies the effectiveness of leveraging and reusing previous models to current data, especially if we can reuse previous models properly (as we will illustrate in the following paragraph).
The main techniques in the proof are inspired by Kuzborskij and Orabona , but we differ in two aspects. First, we only assume the Lipschitz condition, and thus their results are not suitable under our conditions. Second, we extended the analysis of model reuse to multi-class scenarios, and include the results in Appendix C for a better presentation.
Before the next presentation, we need to introduce more notations. Let be as the global cumulative loss on the whole data stream . on epoch , let be as the local cumulative loss suffered by our approach, and as the local cumulative loss suffered by the previous model ,
Next, we show returns a good weight distribution, implying our approach can reuse previous models properly. In fact, we have the following observation regarding to weight distribution.
Observation 1 (Weight Concentration).
During the procedure in epoch , the weights will concentrate on those previous models who suffer a small cumulative loss on .
By a simple analysis on the procedure, we know that the weight associated with the -th previous model is equal to , where . ∎
Though the observation seems quite straightforward, it plays an important role in making our approach successful. The statement guarantees that the algorithm adaptively assigns more weights on better-fit previous models, which essentially depicts the ‘reusability’ of each model. We also conduct additional experiments to support this point in Appendix D.1.
Third, we show that our approach can benefit from recurring concept drift scenarios. Here, we adopt the concept of cumulative regret (or regret) from online learning [Zinkevich, 2003] as the performance measurement.
Theorem 2 (Improved Local Regret [Cesa-Bianchi and Lugosi, 2006]).
Assume that the loss function is convex in its first argument and takes the values in . Besides, the step size is set as , where is the cumulative loss of the best-fit previous model and is supposed to be known in advance. Then, we have,
Refer to the proof presented in page 21 in Chapter 2.4 of Cesa-Bianchi and Lugosi . ∎
Above statement shows that the order of regret bound can be substantially improved from a typical to , independent from the number of data items in the epoch, providing , that is, the cumulative loss of the best-fit previous model is small.
Theorem 2 implies that if the concept of epoch or a similar concept has emerged previously, our approach enjoys a substantially improved local regret providing a proper step size is chosen. This accords to our intuition on why model reuse helps for concept drift data stream. In many situations, although the distribution underlying might change over time, the concepts can be recurring, i.e., disappear and re-appear [Katakis et al., 2010; Gama and Kosina, 2014]. Thus, the statement shows that our approach can benefit from such recurring concepts, and we empirically support this point in Appendix D.2.
4.2 Global Analysis
The global analysis means that we study the overall performance on the whole data stream. We provide the global dynamic regret as follows.
Theorem 3 (Global Dynamic Regret).
Assume that the loss function is convex in its first argument and takes the values in . Assume that the step size in epoch is set as ,333The choice of requires the knowledge of epoch size, which can be eliminated by doubling trick, at the price of a small constant factor [Cesa-Bianchi et al., 1997]. then we have
The proof of global dynamic regret is built on the local static regret analysis in each epoch. And we can see that for data stream with a fix length , the more concept drifts occur (i.e., larger ), the larger the regret will be. This accords with our intuition, on one hand, the sum of best-fit local cumulative loss () is going to be compressed with more previous models. On the other hand, the learning problem becomes definitely harder as concept drift occurs more frequently.
Our strategy is essentially exponentially weighted average forecaster [Cesa-Bianchi and Lugosi, 2006], and thus we have the following local regret guarantee in each epoch.
Lemma 1 (Theorem 2.2 in Cesa-Bianchi and Lugosi ).
Assume that the loss function is convex in its first argument and takes the values in . Assume that the step size is set as , then we have
The proof is based on a simple reduction from our scenario to standard exponentially weighted average forecaster. For epoch , let previous models pool be as the expert pool. Then, plugging the expert number and number of instances into Theorem 2.2 in Cesa-Bianchi and Lugosi , we obtain the statement.
Now, we proceed to prove Theorem 3.
Our proof relies on the application of local static regret analysis. Since is a partition of the whole period , we apply Lemma 1 on each epoch locally and obtain
Essentially, the regret bound in (6) is different from the traditional (static) regret bound. It measures the difference between the global cumulative loss with the sum of local cumulative loss suffered by previous best-fit models. Namely, our competitor changes in each epoch, which depicts the distribution change in the sequence, and thus is more suitable to be the performance measurement in non-stationary environments.
4.3 Multi-Class Model Reuse Learning
In multi-class learning scenarios, the notations are slightly different from those in binary case. We first introduce the new notations for a clear presentation.
Let denote the input feature space and denote the target label space. Our analysis acts on the last data epoch , a sample of points drawn i.i.d. according to distribution , where and with only a single class from . Given the multi-class hypothesis set , any hypothesis maps from , and makes the prediction by . This naturally rises the definition of margin of the hypothesis at a labeled instance ,
The non-negative loss function is bounded by . Besides, we assume the loss function is regular loss defined in Lei et al. .
Definition 1 (Regular Loss).
We call a loss function is a -regular if it satisfies the following properties (Cf. Definition 2 in Lei et al. ):
bounds the - loss from above: ;
is -Lipschitz continuous, i.e., ;
is decreasing and it has a zero point , i.e., there exists a such that .
Then the risk and empirical risk of a hypothesis on epoch are defined by
Our goal is to provide a generalization analysis, namely, to prove that risk approaches empirical risk as number of instances increases, and establish the convergence rate. Since , thus, we cannot directly utilize concentration inequalities to help analysis. To make this simpler, we need to introduce the risk w.r.t. loss function ,
From property c in Definition 1, we know that the risk is a lower bound on , that is . Thus, we only need to establish generalization bound between and . Apparently, , thus we can utilize concentration inequalities again.
First, we identify the optimization formulation of multi-class biased regularization model reuse,
We specify the regularizer as square of Frobenius norm, namely, , and provide the following generalization error bound.
Let be a hypothesis set with . Assume that the non-negative loss function is -regular. Given the model pool with , and denote be a linear combination of previous models, i.e., and supposing and . Let be the model returned by . Then, for any , with probability at least , the following holds,444We use instead of for simplicity.
where and . Besides, and .
To better present the results, we only keep the leading term w.r.t. and , and we have
where , representing the risk of reusing model on current distribution.
From Theorem 4, we can see that the main result and conclusion in multi-class case is very similar to that in binary case. In (11), we can see that Condor enjoys an order generalization bound, which is consistent to the common learning guarantees. More importantly, Condor enjoys an order fast rate generalization guarantees, when , namely, when the previous model is highly ‘reusable’ with respect to the current data. This shows the effectiveness of leveraging and reusing previous models to help build new model in Condor, in multi-class scenarios.
In this section, we examine the effectiveness of Condor on both synthetic and real-world concept drift datasets. Additional experimental regarding to weight concentration, recurring concept drift, parameter study and robustness comparisons are presented in Appendix D.
Compared Approaches. We conduct the comparisons with two classes of state-of-the-art concept drift approaches. The first class is the ensemble category, including (a) [Elwell and Polikar, 2011], (b) [Kolter and Maloof, 2003; Kolter and Maloof, 2007] and (c) [Kolter and Maloof, 2005]. The second class is the model-reuse category, including (d) [Sun et al., 2018] and (e) [Forman, 2006]. Essentially, DTEL and TIX also adopt ensemble idea, we classify them into model-reuse category just to highlight their model reuse strategies.
Settings. In our experiments, we choose ADWIN algorithm [Bifet and Gavaldà, 2007] as the the drift detector with default parameter setting reported in their paper and source code. Besides, for all the approaches, we set the maximum update period (epoch size) ,555Except for Covertype and GasSensor datasets, we set , since Covertype is extremely large with 581,012 data items in total and GasSensor is the multi-class dataset which has a higher sample complexity. and model pool size .
|Dataset||# instance||# dim||# class||Dataset||# instance||# dim||# class|
Synthetic Datasets. As it is not realistic to foreknow the detailed concept drift information of real-world datasets, like the start, the end of change and so on. We employ six widely used synthetic datasets SEA, CIR, SIN and STA with corresponding variants into experiments. Besides, another six synthetic datasets for binary classification are also adopted: 1CDT, 1CHT, UG-2C-2D, UG-2C-3D, UG-2C-5D, and GEARS-2C-2D. A brief statistics are summarized in Table 1. We provide the datasets information in Appendix E.
We plot the holdout accuracy comparisons over three synthetic datasets, SEA200A, SEA200G and SEA500G. Since some of compared approaches are batch style, following the splitting setting in Sun et al. , we split them into 120 epochs to have a clear presentation. The holdout accuracy is calculated over testing data generated according to the identical distribution as training data at each time stamp. For SEA and its variants, the distribution changes for seven times. From Figure 2, we can see that all the approaches drop when an abrupt concept drift occurs. Nevertheless, our approach Condor is relatively stable and rises up rapidly with more data items coming, with the highest accuracy compared with other approaches, which validates its effectiveness.
|Dataset||Ensemble Category||Model-Reuse Category||Ours|
|SEA200A||84.48 0.19||86.07 0.30||84.35 0.86||80.50 0.58||82.79 0.27||86.67 0.21|
|SEA200G||85.48 0.33||86.92 0.13||85.54 0.69||80.73 0.19||82.95 0.12||87.63 0.24|
|SEA500G||86.03 0.19||87.63 0.06||87.14 0.12||80.42 0.24||83.26 0.07||88.21 0.04|
|CIR500G||84.77 0.56||77.09 0.71||76.48 0.81||79.03 0.34||66.38 0.85||68.41 0.87|
|SIN500G||79.41 0.07||66.99 0.10||66.81 0.12||74.93 0.34||62.73 0.14||65.68 0.12|
|STA500G||83.97 0.13||87.43 0.18||86.89 0.27||88.26 0.18||85.95 0.07||88.60 0.07|
|1CDT||99.77 0.14||99.92 0.10||99.92 0.10||99.69 0.11||99.56 0.08||99.95 0.04|
|1CHT||99.69 0.20||99.71 0.28||99.56 0.46||92.08 0.22||99.41 0.22||99.86 0.13|
|UG-2C-2D||94.42 0.12||95.60 0.12||94.36 0.78||93.98 0.13||94.69 0.13||95.27 0.09|
|UG-2C-3D||93.82 0.60||95.23 0.59||94.61 0.73||92.94 0.72||94.31 0.69||94.84 0.59|
|UG-2C-5D||90.30 0.30||92.85 0.23||92.20 0.23||88.21 0.35||89.84 0.38||91.83 0.24|
|GEARS-2C-2D||95.82 0.02||95.83 0.02||95.83 0.02||94.96 0.03||95.03 0.02||95.91 0.01|
|Usenet-1||63.76 2.01||67.26 3.11||62.11 2.67||68.02 1.19||65.03 1.70||73.13 1.12|
|Usenet-2||72.42 1.14||68.41 1.17||70.55 2.41||72.02 0.87||70.56 1.15||75.13 1.06|
|Luxembourg||98.64 0.00||90.42 0.55||90.77 0.52||100.0 0.00||90.99 0.97||99.98 0.03|
|Spam||90.79 0.85||92.18 0.34||91.78 0.33||85.53 1.22||87.10 1.45||95.22 0.48|
|74.21 4.61||72.58 4.10||60.78 6.12||83.36 1.87||79.83 3.73||91.60 1.86|
|Weather||75.99 0.36||70.83 0.49||70.07 0.34||68.92 0.27||70.21 0.33||79.37 0.26|
|GasSensor||42.36 3.72||76.61 0.36||76.61 0.36||63.82 3.64||43.40 2.88||81.57 3.77|
|Powersupply||74.06 0.28||72.09 0.29||72.13 0.23||69.90 0.38||68.34 0.16||72.82 0.29|
|Electricity||78.97 0.18||78.03 0.17||75.62 0.42||81.05 0.35||58.44 0.71||84.73 0.33|
|Covertype||79.08 1.30||74.17 0.87||73.13 1.53||69.43 1.30||64.60 0.89||89.58 0.14|
|Condor w/t/l||18/ 1/ 3||14/ 3/ 4||17/ 2/ 3||19/ 1/ 2||22/ 0/ 0||rank first 16/ 22|
Real-world Datasets. We adopt 10 real-world datasets: Usenet-1, Usenet-2, Luxembourg, Spam, Email, Weather, GasSensor, Powersupply, Electricity and Covertype. The number of data items varies from 1,500 to 581,012, and the class number varies from 2 to 6. Detailed descriptions are provided in Appendix E.2
. We conduct all the experiments for 10 trails and report overall mean and standard deviation of predictive accuracy in Table2, synthetic datasets are also included. As we can see, Condor has a significant advantage over other comparisons. Actually, it achieves the best on 16 over 22 datasets in total. Besides, it ranks the second on four other datasets. The reason Condor behaves poor on CIR500G and SIN500G is that these two datasets are highly nonlinear (generated by a circle and sine function, respectively.). It is also noteworthy that Condor behaves significant better than other approaches in real-world datasets. These show the superiority of our proposed approach.
In this paper, a novel and effective approach Condor is proposed for handling concept drift via model reuse, which consists of two key components, and . Our approach is built on a drift detector, when a drift is detected or a maximum accumulation number is achieved, leverages and reuses previous models in a weighted manner. Meanwhile, adaptively adjusts weights of previous models according to their performance. By the generalization analysis, we prove that the model reuse strategy helps if we properly reuse previous models. Through regret analysis, we show that the weight concentrate on those better-fit models, and the approach achieves a fair dynamic cumulative regret. Empirical results show the superiority of our approach to other comparisons, on both synthetic and real-world datasets.
In the future, it would be interesting to incorporate more techniques from model reuse learning into handling concept drift problems.
Peter L. Bartlett and Shahar Mendelson.
Rademacher and gaussian complexities: Risk bounds and structural
Journal of Machine Learning Research, 3:463–482, 2002.
- Beygelzimer et al.  Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, ICML, pages 2323–2331, 2015.
- Bifet and Gavaldà  Albert Bifet and Ricard Gavaldà. Learning from time-changing data with adaptive windowing. In Proceedings of the Seventh SIAM International Conference on Data Mining, pages 443–448, 2007.
Bousquet et al. 
Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi.
Introduction to statistical learning theory.In Advanced Lectures on Machine Learning, ML Summer Schools 2003, pages 169–207, 2003.
- Bousquet  Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. PhD thesis, Ecole Polytechnique, 2002.
- Cattral et al.  Robert Cattral, Franz Oppacher, and Dwight Deugo. Evolutionary data mining with automatic rule generalization. 2002.
- Cesa-Bianchi and Lugosi  Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- Cesa-Bianchi et al.  Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
- Chen et al.  Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive. 2015.
Crammer et al. 
Koby Crammer, Yishay Mansour, Eyal Even-Dar, and Jennifer Wortman Vaughan.
Regret minimization with concept drift.
Annual Conference Computational Learning Theory, COLT, pages 168–180, 2010.
- de Souza et al.  Vinícius M. A. de Souza, Diego Furtado Silva, João Gama, and Gustavo E. A. P. A. Batista. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SIAM International Conference on Data Mining, SDM, pages 873–881, 2015.
- Du et al.  Simon S. Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. In Advances in Neural Information Processing Systems, NIPS, pages 574–584, 2017.
- Duan et al.  Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple sources via auxiliary classifiers. In International Conference on Machine Learning, ICML, pages 289–296, 2009.
Ryan Elwell and Robi Polikar.
Incremental learning of concept drift in nonstationary environments.
IEEE Transactions on Neural Networks, 22(10):1517–1531, 2011.
- Forman  George Forman. Tackling concept drift by temporal inductive transfer. In International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR, pages 252–259, 2006.
- Gama and Kosina  João Gama and Petr Kosina. Recurrent concepts in data streams classification. Knowledge and Information Systems, 40(3):489–507, 2014.
- Gama et al.  João Gama, Ricardo Rocha, and Pedro Medas. Accurate decision trees for mining high-speed data streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 523–528, 2003.
- Gama et al.  João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 46(4):44:1–44:37, 2014.
- Gomes et al.  Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. A survey on ensemble learning for data stream classification. ACM Computing Surveys, 50(2):23:1–23:36, 2017.
- Harel et al.  Maayan Harel, Shie Mannor, Ran El-Yaniv, and Koby Crammer. Concept drift detection through resampling. In International Conference on Machine Learning, ICML, pages 1009–1017, 2014.
- Harries and Wales  Michael Harries and New South Wales. Splice-2 comparative evaluation: Electricity pricing. Technical Report of South Wales University, 1999.
- Helmbold and Long  David P. Helmbold and Philip M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27–45, 1994.
- Jaber et al.  Ghazal Jaber, Antoine Cornuéjols, and Philippe Tarroux. A new on-line learning method for coping with recurring concepts: The ADACC system. In International Conference on Neural Information Processing, ICONIP, pages 595–604, 2013.
- Kakade et al.  Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865–1890, 2012.
Katakis et al. 
Ioannis Katakis, Grigorios Tsoumakas, and Ioannis P. Vlahavas.
An ensemble of classifiers for coping with recurring contexts in data
European Conference on Artificial Intelligence, ECAI, pages 763–764, 2008.
- Katakis et al.  Ioannis Katakis, Grigorios Tsoumakas, Evangelos Banos, Nick Bassiliades, and Ioannis P. Vlahavas. An adaptive personalized news dissemination system. Journal of Intelligent Information Systems, 32(2):191–212, 2009.
- Katakis et al.  Ioannis Katakis, Grigorios Tsoumakas, and Ioannis P. Vlahavas. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems, 22(3):371–391, 2010.
- Klinkenberg and Joachims  Ralf Klinkenberg and Thorsten Joachims. Detecting concept drift with support vector machines. In International Conference on Machine Learning, ICML, pages 487–494, 2000.
- Klinkenberg  Ralf Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281–300, 2004.
- Kolter and Maloof  Jeremy Z. Kolter and Marcus A. Maloof. Dynamic weighted majority: A new ensemble method for tracking concept drift. In IEEE International Conference on Data Mining, ICDM, pages 123–130, 2003.
- Kolter and Maloof  Jeremy Z. Kolter and Marcus A. Maloof. Using additive expert ensembles to cope with concept drift. In International Conference on Machine Learning, ICML, pages 449–456, 2005.
- Kolter and Maloof  J. Zico Kolter and Marcus A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research, 8:2755–2790, 2007.
- Koychev  Ivan Koychev. Gradual forgetting for adaptation to concept drift. In In Proceedings of ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning, pages 101–106, 2000.
- Kuncheva and Zliobaite  Ludmila I. Kuncheva and Indre Zliobaite. On the window size for classification in changing environments. Intelligent Data Analysis, 13(6):861–872, 2009.
- Kuzborskij and Orabona  Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In International Conference on Machine Learning, ICML, pages 942–950, 2013.
- Kuzborskij and Orabona  Ilja Kuzborskij and Francesco Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106(2):171–195, 2017.
- Ledoux and Talagrand  Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
- Lei et al.  Yunwen Lei, Ürün Dogan, Alexander Binder, and Marius Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, NIPS, pages 2035–2043, 2015.
- Li et al.  Nan Li, Ivor W. Tsang, and Zhi-Hua Zhou. Efficient optimization of performance measures by classifier adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6):1370–1382, 2013.
- Maurer  Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, ALT, pages 3–17, 2016.
- Mohri and Medina  Mehryar Mohri and Andres Muñoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT, pages 124–138, 2012.
- Mohri et al.  Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
- Schapire and Freund  Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.
- Schapire  Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.
- Schlimmer and Granger  Jeffrey C. Schlimmer and Richard H. Granger. Incremental learning from noisy data. Machine Learning, 1(3):317–354, 1986.
- Schölkopf et al.  Bernhard Schölkopf, Ralf Herbrich, and Alexander J. Smola. A generalized representer theorem. In Annual Conference Computational Learning Theory, COLT, pages 416–426, 2001.
- Segev et al.  Noam Segev, Maayan Harel, Shie Mannor, Koby Crammer, and Ran El-Yaniv. Learn on source, refine on target: A model transfer learning framework with random forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1811–1824, 2017.
- Srebro et al.  Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, NIPS, pages 2199–2207. 2010.
- Street and Kim  W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 377–382, 2001.
- Sun et al.  Y. Sun, K. Tang, Z. Zhu, and X. Yao. Concept drift adaptation by exploiting historical knowledge. IEEE Transactions on Neural Networks and Learning Systems, To appear, 2018.
- Suykens et al.  Johan AK Suykens, Tony Van Gestel, and Jos De Brabanter. Least squares support vector machines. World Scientific, 2002.
- Tommasi et al.  Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In , pages 3081–3088, 2010.
- Tommasi et al.  Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Learning categories from few examples with multi model knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):928–941, 2014.
- Vergara et al.  Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A Ryan, Margie L Homer, and Ramón Huerta. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320–329, 2012.
- Vlachos et al.  Michail Vlachos, Carlotta Domeniconi, Dimitrios Gunopulos, George Kollios, and Nick Koudas. Non-linear dimensionality reduction techniques for classification and visualization. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 645–651, 2002.
- Zhou  Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC Press, 2012.
- Zinkevich  Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, ICML, pages 928–936, 2003.
- Zliobaite  Indre Zliobaite. Combining similarity in time and space for training set formation under concept drift. Intelligent Data Analysis, 15(4):589–611, 2011.
Appendix A Prerequisite Knowledge and Technical Lemmas
In this section, we introduce prerequisite knowledge for proving main results technical lemmas. Specifically, we will utilize Rademacher complexity [Bartlett and Mendelson, 2002] in proving generalization bounds. Besides, we will also exploit the function properties when bounding Rademacher complexity and proving the regret bounds.
a.1 Rademacher Complexity
To simplify the presentation, first, we introduce some notations. Let be a sample of points drawn i.i.d. according to distribution , then the risk and empirical risk of hypothesis are defined by
In the following, we will utilize the notion of Rademacher complexity [Bartlett and Mendelson, 2002] to measure the hypothesis complexity and use it to bound the generalization error.
(Rademacher Complexity [Bartlett and Mendelson, 2002]) Let be a family of functions and a fixed sample of size as . Then, the empirical Rademacher complexity of with respect to the sample is defined as:
Besides, the Rademacher complexity of is the expectation of the empirical Rademacher complexity over all samples of size drawn according to :
a.2 Function Properties
In this paragraph, we introduce several common and useful functional properties.
Definition 3 (Lipschitz Continuity).
A function is -Lipschitz continuous w.r.t. a norm over domain if for all , we have
Definition 4 (Strong Convexity).
A function is -strongly convex w.r.t. a norm if for all and for any , we have
A common and equivalent form for differentiable case is,
Definition 5 (Smoothness).
A function is -smooth w.r.t. a norm if for all , we have
If is differentiable, the above condition is equivalent to a Lipschitz condition over the gradients,
a.3 Technical Lemmas
To obtain a fast generalization rate, essentially, we need a Bernstein-type concentration inequality. And we adopt the functional generalization of Bennett’s inequality due to Bousquet [Bousquet, 2002], for self-containedness, we state the conclusion in Lemma 2 as follow.
Lemma 2 (Theorem 2.11 in Bousquet ).
Assume the are identically distributed according to . Let be a countable set of functions from to and assume that all functions in are -measurable, square-integrable and satisfy . If then we denote
and if , can be defined as above or as
Let be a positive real number such that almost surely, then for all , we have
with and , also
Besides, for a strongly convex regularizer, we have following property, which will be useful in proving Theorem 1.
Lemma 3 (Corollary 4 in Kakade et al. ).
If is -strongly convex w.r.t. a norm and , then, denoting the partial sum by , we have for any sequence and for any ,
Lemma 4 (Lemma 8.1 in Mohri et al. ).
Let be hypothesis sets in , and let . Then, for any sample of size , the empirical Rademacher complexity of can be upper bounded as follows: